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application also claims benefit, under 35 U.S.C. § 1 19(e), of U.S. Provisional Patent 
10 Application No. 60/536,862 filed on January 15, 2004 which is incorporated herein, 
by reference, in its entirety. 



1. FIELD OF THE INVENTION 

15 The field of this invention relates to computer systems and methods for 

designing sets of antibody variants and tools for relating the fimctional properties of 
such antibodies to their sequences. These relationships can then be used to determine 
the relationship between an antibody's sequence and commercially relevant properties 
of that antibody. Such sequence-function relationships may be used to design and 

20 synthesize commercially usefiil antibody compositions. 

2. BACKGROUND OF THE INVENTION 

Because of the immense size of sequence space, there is no effective way to 
25 systematically screen all possible permutations of an antibody for a desired property. 
To test each possible amino acid at each position in an antibody, rapidly leads to such 
a large muaber of molecules to be tested such that no available methods of synthesis 
or testing are feasible. Furthermore, most molecules generated in such a way would 
lack any measurable level of the desired property. Total sequence space is very large 
30 and the fimctional solutions in this space are sparsely distributed. 

Two primary approaches have to date been used to identify antibody 
molecules with desired properties: mechanistic and empirical. There are significant 
limitations to both of these approaches. The mechanistic approach is often hampered 
by insufficient knowledge of the system to be improved, meaning either that 
35 considerable resources must be devoted to characterizing the system (for example by 
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obtaining high quality protein crystal structures and relating these to the properties of 
interest), or that meaningful predictions cannot be made. In contrast, the empirical 
approach requires no mechanistic understanding, but relies upon direct measurements 
of an antibody's properties to select those variants that are improved. This strength is 

5 also its weakness; large numbers of variants cannot typically be tested imder 

conditions that are identical to those of the final application. High throughput screens 
are widely used to provide siurogate measurements of the properties of interest, but 
these are often inadequate: binding of an antibody to an anitgen is often an inadequate 
predictor of clinical or diagnostic function. 

10 Empirical engineering of antibodies relies upon creating and testing sets of 

variants, then using this information to design and synthesize subsequent sets of 
variants that are enriched for components that contribute to the desired activity. A 
key limitation for any empirical antibody engineering is in developing a good assay 
for antibody function. The assay must measure antibody properties that are relevant 

15 to the final application, but must also be capable of testing a sufficient number of 

variants to identify v^hat may be only a small fraction that are actually improved. The 
difiiculty of creating such an assay is particularly relevant when optimizing antibodies 
for complex functions that are difficult to measure in high throughput. Examples 
include reduction of viral titer or the killing of tumor cells. 

20 Large numbers of variants cannot typically be tested under conditions that are 

identical to those of the final application. High througlqjut screens are widely used to 
provide surrogate measurements of the properties of interest, but these are often 
inadequate. As examples, binding of an antibody to an antigen in a phage display 
assay can have litUe bearing on its ultimate usefulness as a therapeutic protein. 

25 Limitations in current methods for searching through antibody sequences for 

specific conunercially relevant functionalities creates a need in the art for methods 
that can design and synthesize small numbers of variants for functional testing and 
that can use the resulting sequence and functional information to design and 
synthesize small numbers of variants improved for a desired commercially useful 

30 activity. Limitations in current methods for choosing surrogate screens appropriate 
for empirical antibody engineering creates a need in the art for methods that can 
design and create small nimibers of variants that can then be tested for specific 
commercially relevant functionalities. 
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3, SUMMARY OF THE INVENTION 

The systems and methods described here apply novel computational biology 
and data mining techniques to important molecular design problems. In particular, 
5 novel ways to map antibody sequence space are described. Such maps are used to 
direct perturbations or modifications of the antibody seqiiences in order to perturb or 
modify the activity of the antibodies in a controlled feshion. 

Methods are disclosed for biological engineering using the design and 
synthesis of a set of sequences containing designed substitutions that are statistically 

1 0 representative of a sequence space, and that contain a high firaction of antibodies 
possessing desired properties. In addition to its functionality, each antibody is also 
designed to maximize the information that the set of antibodies contains regarding the 
contribution of substitutions to the desired antibody properties and to the 
contributions resulting from interactions between substitutions. This in essence is"a 

15 map of the sequence space that can also be used to design perturbations to modify the 
functionality of the antibody as desired. 

The information used to create the substitutions that defme the sequence space 
can be derived from one or more of (i) multiple sequence alignments, (ii) 
phylogenetic reconstructions of ancestral sequences, (iii) analysis of families or 

20 superfamilies of antibodies related by sequence, structure, function or partial function, 
(iv) analysis of monomer substitution probabilities within classes of antibody, (v) 
three dimensional structures (e.g., molecular models, X-ray ciystallogr^hic 
structures, nuclear magnetic resonance models, molecular dynamic simulations), (vi) 
immunogenic constraints, (vii) prior knowledge about the structure and/or function of 

25 the sequences upon which design of the antibody set is to be based, or (viii) any 
similar information pertaining to a related or homologous antibody. In one 
embodiment of the invention, this process is automated by use of an expert system 
that acquires domain knowledge and captures it is a knowledge database. This 
process can provide a score or rank order of substitutions to be incorporated, and a 

30 reasoning based on user specified constraints and domain specific data. 

Generally speaking, the first step in the design and manufacture of the 
statistically representative sequence sets of this invention is the definition of the initial 
sequence space to be searched. This involves defining one or more reference 
sequences, identifying positions that are likely to tolerate alteration, and identifying 
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substitutions at these positions that are likely to be acceptable or to produce desired 
changes in the properties of the antibody. All possible combinatorial strings of 
polymeric biological molecules defme the total defined sequence space to be 
searched. Each substitution at each position is typically enumerated in silico and the 

5 acceptability defmed computationally. Desirability or acceptability of each possible 
substitution is calculated according to one or more criteria. Such calculations can be 
performed by a computational system using the knowledge database, user specified 
constraints, and/or domain and antibody specific data. 

The present invention also provides a more formal systematic method for 

10 selecting substitution positions. The use of a formal system involves quantitative 
scores and/or filters for assessing the favorability of substitution positions and the 
substitutions possible at those positions. Formalizing the system for substitution 
selection allows for the development of an automated system for antibody 
optimization or humanization. The parameters, filters and scores can be adjusted 

1 5 based on data from the scientific literature and data from experiments designed or 

interpreted by the automated system. By adjusting the scores and filters, substitutions 
that are predicted to be favorable can be aligned with those found experimentally to 
be favorable. Continuous refmement of these scores and filters based on experimental 
or computational data provides a way for the antibody optunization system to learn 

20 and improve. This formalization and learning capability are an aspect of the 
invention. 

The second step in the design and manufacture of the statistically 
representative sequence sets of this invention is to define a subspace of the total 
sequence space to be searched in each iteration of the synthesis testing and correlating 

25 process. Typically the total allowed space matrix contains lO'-lO" antibodies, many 
orders of magnitude larger than can be synthesized and measured under commercially 
relevant conditions. Such commercially relevant conditions are presently limited to 
numbers in the range of 10»-10^. The number of antibody variants that can be 
synthesized and tested under appropriate conditions is defmed by the availability of 

30 resources. The number of variant positions and the number of substitutions that can 
be tested at each of those positions is then calculated, such that each substitution will 
be present in a statistically representative fraction of the set of antibodies to be 
synthesized. Additionally, when using search methods like Tabu, Ant optimization or 
similar techniques, the space can be searched on a sequence by sequence basis by 
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usii^ a memoiy of the space that has been visited previoiisly and the properties 
encountered. 

Typical experimental design methods can introduce more changes in an 
antibody than the antibody can tolerate to remain functional. Adaptations of these 
5 methods, for example by using covering algorithms to reduce the total number of 
substitutions in each antibody variant, while maximizing the number of different 
combinations of pairs of substitutions is another aspect of the invention. 

The third step in the design and manufacture of the statistically representative 
sequence sets (or sequence sets relevant for specific optimzation techniques) of this 

10 invention is to create a set of variant antibodies. This can be performed by 

synthesizing the antibody sequences defined and designed in the first two steps. The 
systematic design of such variants is one aspect of the present invention. The 
antibodies can be synthesized mdividually, or in a multiplexed set that is subsequently 
deconvoluted by sequencing or some other appropriate method. Alternatively, the 

15 antibodies can be created as a library of variants. Many methods have been described 
in the art for creating such libraries. See, for example, Stemmer (1994) Proc Natl 
Acad Sci U S A 91: 10747-51.; Stemmer (1994) Nature 370: 389-91.; Crameri et al 
(1996) Nat Med 2: 100-2.; Crameri e^o/. (1998) Nature 391: 288-291; Ness a/. 
(1999) Nat Biotechnol 17: 893-896; Volkov et al. (1999) Nucleic Acids Res 27: el8.; 

20 Volkov et al. (2000) Methods Enzymol 328: 447-56.; Volkov et al (2000) Methods 
Enzymol 328: 456-63.; Coco et al. (2001) Nat Biotechnol 19: 354-9.; Gibbs et al. 
(2001) Gene 271 : 13-20.; Ninkovic et al. (2001) Biotechniques 30: 530-4, 536.; Coco 
et al (2002) Nat Biotechnol 20: 1246-50.; Ness et al (2002) Nat Biotechnol 20: 
1251-5.; Aguinaldo etal (2003) Methods Mol Biol 231: 105-10.; Coco (2003) 

25 Methods Mol Biol 231 : 1 1 1-27.; and Sun et al (2003) Biotechniques 34: 278-80, 282, 
284 passim. Alternatively, specifically designed antibodies can be synthesized 
individually. 

After synthesis, the designed set(s) of antibodies are characterized functionally 
to measure the properties of interest. This requires the development of an assay or 
30 . surrogate assay faithful to the property or properties of ultimate interest and to test 
some members of the set of variants for more than one property, including the 
property of ultimate interest. Data mining techniques are then employed to 
characterize the fimctions of the variants and to derive a relationship between 
antibody sequences and properties. Optionally, the characterization data can be used 
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to provide information in a subsequent iteration of the method, aiding in the design of 
a subsequent set of statistically representative variants that can be synthesized and 
tested to obtain a molecule with even more desirable properties. The data from 
additional iterations of this process can also be used to refine the data mining 

5 algorithms and models produced from the first set of data. The knowledge created 
about the sequence space can in turn be incorporated into the knowledge database for 
evaluating the substitutions in the light of this data and recalculating the scores or 
rank order of the substitutions. These processes are aspects of the present invention. 
Additionally, combinations of the methods described herein can be made with 

10 other techniques such as directed evolution, DNA shuffling, family shuffling and/or 
systematic scanning approaches. These can be performed in any order and for any 
number of iterations to produce the products described herein. All such combinations 
are within the scope of the invention. 

1 5 4. BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates an overview of the architecture of an Expert System in 
accordance with an embodiment of the present invention. 

20 Fig. 2 illustrates a flowchart for an antibody engineering method using 

integrated information sources to choose initial substitutions, and sequence-activity 
relationships to assess tiiem in accordance witii an embodiment of tiie present 
invention. 

25 Fig. 3 is a schematic representation of a method for selecting amino acid 

substitutions for the optimization or humanization of antibodies in accordance witii an 
embodiment of the present invention. 

Fig. 4 illustrates a method for calculation of weights (e.g. contiibutions to 
30 activity) for each amino acid substitution in accordance with an embodiment of the 
present invention. 



Fig. 5 illustrates a method for calculation of weights (e.g., contiibutions to 
activity) for each substitution in accordance with an embodiment of the present 
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invention. This method provides information about the confidence of each weight by 
comparison with weights obtained fi-om randomized data. 

Fig. 6 illustrates the amino acid sequence of wild type proteinase K, reported 
by Gunkel et al. (1989) Eur J Biochem 179: 185-194, modified by (i) replacement of 
the fungal leader peptide with an E. coli leader peptide, amino acids -20 to -1 (SEQ 
ID No. 1), and (ii) addition of a histidine tag to the C terminus (amino acids 372-377), 
together with a ValAsp preceding the tag (amino acids 370 and 371) to accommodate 
cloning sites in the nucleic acid sequence. 



Fig. 7 illustrates flie nucleotide sequence of proteinase K optimized for 
expression in E coli. The E coli leader peptide (amino acids -20 to -1 in Fig. 6) are 
encoded by nucleotides -60 to -1 in Fig. 7. The proteinase K sequence, beginning 
with Ala at amino acid 1 and ending with Ala at amino acid 369, is encoded by 
15 nucleotides 1-1 107. The histidine tag, the two additional amino acids described in 
Fig. 6 and the termination codon are encoded by nucleotides 1 108-1 133. 

Fig. 8 shows the accession numbers of 49 proteinase K homologs obtained by 
BLAST searching of Genbank. 

20 

Fig. 9 illustrates a distribution of proteinase K homolog sequences (Usted in 
Fig. 8) in the fu-st two principal components of the sequence space. Sequences 46-49 
are derived from thermostable organisms. 

25 Fig. 1 0 illustrates a corresponding plot of all loads describing the influence of 

each variable on the sample distribution of Fig. 9 

Fig. 1 1 provides magnified detail of the bottom left quadrant from Fig. 10. 

30 Fig. 12 provides principal component analysis-derived loads for individual 

amino acids responsible for clustering of thermostable proteinase K homologs. 



Fig. 13 illustrates sample output from an Expert System defming the 24 most 
highly scoring substitutions to be incorporated into a set of variants for initial 

7 
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mapping of proteinase K sequence-fimction space in accordance with an embodunent 
of the present invention. 

Fig. 14 illustrates a first designed set of 24 variants for proteinase K. Each 
variant contains six substitutions from the wild type sequence. The numbers refer to 
the substitutions identified in Fig. 13. 

Fig. 15 illustrates a second designed set of variants for proteinase K. 

Figs. 16A - 16F illustrate amino acid changes in a set of synthesized 
protemase K variants. Each column shows the changes from the wild type sequence 
present in one variant. A blank cell indicates the wild type sequence at that position. 
Amino acid numbering is shown in Fig. 6. 

Figs. 17A and 17B provide activity measurements of proteinase K variants. 
Proteinase K variants were assessed for six different hydrolytic activities. AU 
activities are normalized to the average performance of the wild type proteinase K. In 
Figs 17A and 17B, yl: hydrolysis of a modified tetrapeptidcN-succinyl-Ala-Ala-Pro- 
Leu-p-nitroanilide (AAPL-p-NA) by purified proteinase K variants at pH 7.5; y2: 
thermostability ratio: activity after heat /activity without heat treatment, y6/yl;y4: 
hydrolysis of a modified tetrapeptide, N-succinyl-Ala-Ala-Pro-Leu-p-nitroanilide 
(AAPL-p-NA) by purified proteinase K variants at pH 4.5; y5: hydrolysis of a 
modified tetrapeptide, N-succinyl-Ala-Ala-Pro-Leu-p-nitroanilide (AAPL-p-NA) by 
purified proteinase K variants at pH 5.5; y6: hydrolysis of a modified tetrapeptide, N- 
succinyl-Ala-Ala-Pro-Leu-p-nitroanilide (AAPL-p-NA) at pH 7.5 by purified 
proteinase K variants which have been exposed to a heat treatment of eS-'C for 5 
minutes; and y7: hydrolysis of casein measured as clearing zones, in an LB agar plate 
containing 2% skimmed milk, around a bacterial colony expressing the variant. 
Duplicate values indicate tiiat a variant's activity was measured on two separate 
occasions. 

Fig. 1 8 illustrates a comparison between values predicted and values measured 
for a protein sequence-activity model derived bom sequences shown in Fig. 16 and 
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activity data (y6) shown in Fig. 17. Measured activities of proteinase K variant 
activities towards AAPL-p-NA following a five minute 65°C heat treatment on the y- 
axis are compared with those predicted by the model on the x-axis. AU activities were 
measured at 3TC and pH 7.0 using purified protein. 

Fig. 19 illustrates the identification of amino acids contiibuting to a specific 
fimction from a sequence-activity model. Regression coefficients (squares, left axis) 
of variant amino acids were derived ftom tiie sequence-activity model relating the 
sequences of proteinase K sequence variants (with numbers lower than 49) to activity 
y6. The number of occurrences of each amino acid substitution are also shown 
(diamonds, right axis). Changes fi-om the wUd type sequence are circled. 

Fig. 20 illustrates the use of sequence-activity modeling to design a new 
variant with improved activity. Four amino acid substitiitions were found to have 
positive regression coefficients in then: contribution to activity following heat- 
treatment (y6). The variant test set contained one variant with one of tiiese changes 
(#19) and one witii three of tiiese changes (#40). A new variant (#56) was 
syntiiesized to contain all four changes. The graph shows tiie activity of these 
variants towards AAPL-p-NA following five minute 65°C heat treatment. Purified 
proteins were heated to eS^C tiien incubated witii AAPL-p-NA at pH 7.5. The 
reaction was foUowed by measuring Oie absorbance at 405 nm. Alterations from tiie 
wild type sequence are: #19, K208H (filled triangles); #S40, V267I, G293A, K332R 
(open circles); #56, K208H. V267I. G293A, K332R (filled squares). 

Fig. 21 illustiates how dififerent amino acids are important for different 
fijnctions in proteinase K. Beneficial amino acid substitutions were calculated by 
sequence-activity modeling for tiuee different proteinase K properties. Changes from 
the wild type sequence are underlined. 

Fig. 22 is a schematic representation of a method for selecting amino acid 
substitutions for tiie optimization of antiviral activity of an antibody in accordance 
with an embodiment of the present invention. 
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Fig. 23 is a schematic representation of a method for selecting amino acid 
substitutions for the humanization of an antibody in accordance with an embodiment 
of the present invention. 

Fig. 24 is a list of germline sequence locus identification numbers obtained 
from VBase (http://www.mrc-cpe.cam.ac.uk ) 

Fig. 25 illustrates a distribution of RSV antibody and antibody sequences 
(listed in Fig. 24) in the first two principal components of the sequence space. 

) 

Fig. 26 iUustrates a corresponding plot of all loads describing the influence of 
each variable on the sample distribution of Fig. 25 
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Fig. 27 provides magnified detail of the right center from Fig. 26 

Fig. 28 provides principal component analysis-derived loads for individual 
acids responsible for clustering of sequences in group containing the sequence 
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Fig. 29 is a Ust of germline sequence locus identification numbers obtained 
from VBase (http://www.mrc-cpe.cam.ac.uk ) 



Fig. 30 illustrates a distribution of AAF21612 antibody and antibody 
sequences (listed in Fig. 29) in the first two principal components of the sequence 
25 space. 

Fig. 31 illustrates a corresponding plot of all loads describing the infiuence of 
each variable on the sample distribution of Fig. 30 

30 Fig. 32 provides magnified detail of the bottom center from Fig. 3 1 

Fig. 33 provides principal component analysis-derived loads for individual 
amino acids responsible for clustering of sequences in group containing the sequence 
5-a 
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5. DETAILED DESCRIPTION OF THE INVENTION 

A general antibody humanization and/or maturation scheme is shown in Fig. 
5 2. These steps found in Fig. 2 will be briefly introduced here and described in more 
detail below. 

Step 01. An antibody or a plurality of antibodies, that partially or fully 
achieves the desired property (e.g., function, being humanized and/or matured) is used 
10 as a starting point (step 01). 

Step 02. Substitutions to a sequence of step 01 are identified using a 
combination of changes to tiie antibody sequence. Such changes are eitiier in 
monomer identity or in monomer physico-chemical properties. These changes span 

1 5 either the CDR and/or tiie framework region of heavy chain and/or the light chain of 
the antibody . For example, consider the case in which the heavy chain of the 
antibody is being humanized. In step 02, a determination can be made tiiat the 21" 
and 49* positions of tiie heavy chain (based on the kabat numbering scheme) can be 
changed. Moreover, m some embodiments, a determination is made as to which 

20 substitutions can be made at such positions in step 02. For instance, step 02 may not 
only detennine that the 21=' position of die antibody can be changed, but may also 
determine tiiat tiiis position should be changed to a glycine, alanine, or leucine. 

In typical embodiments, several independent rules are used to determine 
which positions of tiie antbodies of step 01 can be changed. Each such rule scores or 

25 ranks individual substitiitions based on different metiiods and based on tiie nature of 
optimization (i.e) humanization or matiiration. Representative rules include, but are 
not limited to, rules based on (i) changes found in functional, stiiictural or sequence 
classes, (ii) changes predicted to be favorable using substitution matiices, (iii) 
changes predicted using evolutionary analysis of the antibody structural and sequence 

30 classes, (iv) changes seen in random mutageneis screening, (v) changes predicted by 
structural modeling, (vi) changes proposed by an expert on tiie antibody and (vii) 
changes predicted to be favorable using structural information (vii) changes derived 
from comparing the framework region of tiie antibodies witii human germline 
sequences (viii) changes derived from comparing flie framework regions of human 
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antibodies (ix) changes derived from substitution matrices constructed from the 
positional frequencies of amino acids in the CDR regions of all antibodies, (x) Any 
number of rules can be applied to the one or more antibodies of step 01 . 

In some embodiments of the present invention, each independent rule assigns 
5 a score for each possible substitution position (e.g. residue) in the antibody of step 01 . 
The scores generated by each of the rules are then combined by methods and/or filters 
to determine the positions in the antibody that are suitable for change. These scores 
generated by each of the rules are specific to nature of the optimization process, (i.e) 
scores are independentiy derived for humanization of antibodies and for maturation of 
10 antibodies. 

Step 03. Step 02 identified a set of candidate substitution positions in tiie 
antibodies of step 01. In step 03, a variant set incorporating such candidate 
substitiitions is designed such that each candidate substitution is tested in combination 

1 5 with many different otiier candidate substitutions in order to cover the possible search 
space as evenly as possible (step 03). 

To illustrate, consider tiie case in which the antibody of step 01 , is a murine 
antibody and the 2"^ 5*. and 1 5* kabat positions of tiie heavy chain has been 
identified as candidate substitution positions in step 03. Assuming tiiat each of these 

20 tiiree positions can be independently substituted witii any of tixe twenty naturally 
occurring amino acids, tiiere are 20^ -1 different variant antibodies tiiat could be 
constructed. In some instances, step 02 will constram tiie types of amino acids tiiat 
can be substituted at tiiese positions based on tiie rules described above. 
Nevertheless, die full antibody sequence space proposed in step 02 even after filtering 

25 can be large. Step 03 seeks to minimize tiie number of variants tiiat are constricted in 
order to evenly search and sample tiiis large sequence space. 

Step 04. Variant antibodies selected in step 03 are individually syntiiesized 
and tested for fimction(s) of interest in step 04. When tiie variant antibodies are 
30 syntiiesized individually it is easier to keep ttie number of changes and tiie number of 
variants syntiiesized and tested in each iteration of tiie process relatively small. In 
some embodiments, between 5 and 200. more preferably between 10 and 100, and 
even more preferably between 15 and 50 variants are syntiiesized and tested in step 
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04. By minimizing the nimiber of variants synthesized and tested, relatively 
inaccurate high throughput assay screens can be avoided in step 04. 

Step 05. Various machine-learning methods or other data-mining techniques 
are used to model the relationship between the sequences and activities of the variant 
antibodies in step OS. 

Step 06. The assessments of the affect of each substitution upon the properties 
(functions) of the antibodies by each model tested in step 05 are combined in step 06. 

Step 07. The assessments of the afifect of each substitution upon the properties 
(functions) of the antibodies by each tested model that was made in step 06 is used in 
step 07 to design a new set of variant antibodies for synthesis and testing 

Repeating steps 04 - 07. Steps 04 through 07 are repeated a number of times. 
Each iteration of steps 04-07 seeks to design a set of high scoring and diverse 
antibodies for synthesis and functional testing. Each new set of measurements from 
an iteration of step 04 is used to refine the sequence-activity model until an end point 
is reach, at which point the method progresses to step 08, 

Step 08. The performance of the methods used to select substitution positions 
in step 02 and to model the sequence-activity relationships in instances of step 05 are 
assessed by analyzing the sequences of the best performing variants. In general, the 
best performing variants are any variants in any iteration of the cycle defined by steps 
04-07 that score best in one or more functional assays for the target antibody. Step 08 
provides a method for tuning the adjustable parameters of the system. Once these 
parameters have been adjusted, steps 02 through 07. mcluding multiple iterations of 
the cycle defined by steps 04-07, are repeated. Advantageously, one of the adjustable 
parameters of the system is the individual weights for each of the methods applied in 
step 02. For example, those step 02 method that were good at identifying substitution 
positions associated with high scoring antibody variants are up-weighted in the next 
mstance of steps 02 through 07. The modification of weights applied to methods in 
step 02 based on the results of cycles of steps 04-07 allows the system to learn from 
previous results thereby improving the accuracy with which the system can identify 
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beneficial substitutions (in step 02) and assess the contribution of substitutions to 
antibody activity (in steps 05 and 06). 

5.1 EXPERT SYSTEMS FOR DEFINING A SEQUENCE SPACE 
Fig. 1 details an exemplary system that supports the functionaUty described 
above. The system is preferably a computer system 10 having: 
• a central processing unit 22; 

a main non-volatile storage unit 14. for example a hard disk drive, for 
storing software and data, the storage unit 14 controlled by storage controller 12; 

a system memory 36, preferably high speed random-access memory 
(RAM), for storing systbm control programs, data, and application programs, 
comprising programs and data loaded from non-volatile storage unit 14; system 
1 5 memory 36 may also include read-only memory (ROM); 

a user interface 32, comprising one or more input devices (,e.g., 
keyboard 28) and a display 26 or other output device; 

a network interface card 20 for connecting to any wired or vsdreless 
communication network 34 (e.g., a wide area network such as the Internet); 
20 , an internal bus 30 for interconnecting the aforementioned elements of 

the system; and 

a power source 24 to power the aforementioned elements. 
Operation of computer 10 is controlled primarily by operating system 40, 
which is executed by central processing unit 22. Operating system 40 can be stored in 
25 system memory 36. In a typical implementation, system memory 36 includes: 

• operating system 40; 

file system 42 for controlling access to the various files and data 

structures used by the present invention; 

• auser interface 104; 
30 .an expert system 100; 

• case-specific data 110; and 

• knowledgebase 108. 
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As illustrated in Fig. 1, computer 10 comprises case-specific data 1 10 and 
knowledge base 108. Case-specific data 1 10 and knowledge base 1 08 each 
independently comprise any form of data storage system including, but not limited to, 
a flat file, a relational database (SQL), and an on-line analytical processing (OLAP) 
database (MDX and/or variants thereof). In some specific embodiments, case-specific 
data 110 and/or knowledge base 108 is a hierarchical OLAP cube. In some specific 
embodiments, case-specific data 1 10 and/or knowledge base 108 comprises a star 
schema that is not stored as a cube but has dimension tables that defme hierarchy. In 
some embodiments, case-specific data 1 10 and/or knowledge base 108 is respectively 
a single database. In other embodiments, case-specific data 110 and/or knowledge 
base 108 in feet comprises a plurality of databases that may or may not all be hosted 
by the same computer 10. In such embodiments, some component databases of case- 
specific data 1 10 and/or knowledge base 108 are stored on one or more computer 
systems that are not illustrated by Fig. 1 but that are addressable by wide area network 
34. 

It will be appreciated that many of the modules illustrated in Fig. 1 can be 
located on one or more remote computers. For example, some embodiments of the 
present application are accessible in web service-type implementations. In such 
embodiments, user interface module 104 and other modules can reside on a client 
computer that is in communication with computer 10 via network 34. In some 
embodiments, for example, user interface 104 can be an interactive web page. 

In some embodiments, the case-specific data 1 1 0 and/or knowledge base 108 
and modules (e.g. modules 100, 104, 112, 106, 116, 114. 118. 130, 132)iUustratedin 
Fig. 1 are on a single computer (computer 10) and in other embodiments such data is 
hosted by several computers (not shown). Any arrangement of case-specific data 110 
and knowledge base 108 and the modules illustrated in Fig. 1 on one or more 
computers is within the scope of the present invention so long as these components 
are addressable with respect to each other across network 34 or by other electronic 
means. Thus, the present invention fiiUy encompasses a broad anay of computer 
systems. 

Now that an overview of a computer system and the data structures stored in 
such a computer system has been presented, more details on the inventive data 
structures and software modules of the present invention will be described. 



15 



wo 2005/012877 PCT/US2004/024751 

Expert system 100 is a software module that includes stored knowledge and 
solves problems in a specific field (for example antibody engineering) by emulating 
some of the decision processes of a human expert(s). The first set of algorithms that 
chooses the substitutions and the sequence space to explore for antibody engineering 
(steps 02 and 03 of Fig. 2) may require expertise in the domains of polynucleotide 
structure and function, antibody structure and function, protein structural analysis and 
interpretation, protein structure and function, protein and nucleic acid phylogeny and 
evolution, chemical and enzymatic mechanisms, bioinformatics and related fields. 
Expert system 100 applies the knowledge to problems specified by a user who is not 
necessarily an expert in the domain(s). This invention describes the construction and 
use of expert system 100 for selecting substitutions usefiJ for mapping and 
engineering antibody fimctions. 

Two fimctions expert system 100 provides in order to define a sequence space 
to search are (i) the identification of one or more positions, in tiie antibody at which 
substitution is likely to be accepted and where at least some substitiitions. insertions, 
deletions or modifications are likely to result in a functional antibody and (ii) the 
identification of residues or modifications that are likely to result in a fimctiomd 
antibody when used to substitute or insert at each of the one or more positions 
identified in (i). An additional or alternative purpose of expert system 100 is the 
identification of residues or modifications tiiat are likely to affect the desired 
properties or fimctions of the antibody. These fimctions are represented as step 02 in 
Fig. 2. 

One aspect of this invention is the use of methods to identify positions that 
can be varied, then to synthesize a set of antibody variants containing tiiese 
substitutions and to test the antibodies for one or more property or fimction, witii tiie 
aim of deriving relationships between antibody sequence and fimction. 

A user can interact with expert system 100 using user interface 104. In some 
embodiments, user interface 104 comprises menus, natural language or any other style 
of interaction. Expert system 100 uses inference engine 106 to reason using the 
expert knowledge stored in knowledge database 108 together with case-specific data 
1 10 relating to the specific antibody or class of antibodies to be mapped and / or 
engineered. Case-specific data 1 10 can be acquired as input from tiie user of expert 
system 100. presented in kiiowledge base 108. or acquired from case-specific 
knowledge generated by tiie results of experimentation and tiie analysis facilitated by 
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sequence-activity correlating methods of this invention described in further detail 
below. These sequence-activity correlating methods are performed in step 05 of Fig. 
2, for example. The data firom these sequence-activity correlating methods can 
additionally be used to add to or alter the information contained within knowledge 
base 108. 

Expert knowledge will typically be stored in knowledge base 108 in the form 
of a set of rules 120. An exemplary rule 1 20 is: 

IF (an antibody protein has known variants that possess some activity) 
THEN{ 

assign probabilities for incorporating the variant residues 
based on their occurrence in some set of other naturally 
occurring antibodies and/or synthetically derived antibodies 
using a substitution matrix to determine the likelihood of such 
a substitution occurring in nature 

} 

Another exemplary rule 120 is: 

IF (desired activity is binding affinity) 
THEN{ 

Change weights used to score/ rank the substitutions found in known 
antibody classes that bind to the desired target 

} 

Additional examples of rules 120 are each of the filters described in Figs. 4 and 5. 

Case-specific data 1 10 can be precompiled by experts. It can also be obtained 
as user response to questions contained in a component of expert system 100, for 
example user interface 104, knowledge base 108 or inference engine 106. 

The functionality relied on by rules 120 of expert system 100 can also be 
obtained, in part, by a set of automatic actions executed using one or more 
computational processes 118. An example of a computational process 1 1 8 is: 

Upon input of a target sequence (from Step 01) { 

202 Search one or more sequence databases for homologs of 

the target antibody sequence. Store any such sequences in knowledge 

base 108 
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204 Identify any funcstional information provided for any of 
these target antibody sequences by any of these databases. Store any 
such functional information in knowledge base 108 

206 Search one or more structure databases for homologs of 
the target antibody sequence. Store any such homolog structural 
information in knowledge base 108. 

208 Search one or more databases for known variants of the 
target antibody sequence. Store any sequence and functional 
information in knowledge base 108. 

210 Compute the scores for every enumerated substitution 
found in steps 202 through 208 using select rules 120. 

Computational processes 118 can be stored in knowledge base 108 as 
illustrated in Fig. 1, in expert system 100, or in any data structure that is accessible by 
expert system 100. Some embodiments of expert system 100 include explanation 
subsystem 1 12. Explanation subsystem 1 12 provides reasons to the user for why 
particular substitutions are selected by rules 120. Some embodiments of expert 
system 100 include knowledge base editor 1 14 to allow an administrator to add, 
delete, or modify components of knowledge base 108 including, but not limited to, 
rules 120. 

In some embodiments, expert system 100 provides scores for each substitution 
enumerated along with the contribution to that score from various methods 130 used 
to evaluate the desirability of each substitution. The weights 132 for the various 
methods 130 are derived from knowledge base 108 and can be updated by an expert 
using knowledge base editor 108 and can also be updated automatically using rules in 

knowledge base 108. 

Inference engine 106 is a software module that reasons using mformation 
stored in knowledge base 108. One embodiment of inference engme 106 is a rule- 
based system. Rule-based systems typically implement forward or backward chaining 
strategies. Inference engine 106 can be goal driven using backward chaining to test 
whether some hypothesis is true, or data driven, using forward chaining to draw new 
conclusions from existing data. Various embodiments of expert system 100 can use 
either or both strategies. For example, some topics that can be posed by expert 
system 100 in a goal driven^ackward chaining strategy can include: (i) how 
conservative should an approach be. (ii) how many iterations of the process are Ukely 
to achieve the activity of interest, (iii) by what factor should the desired activity 
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increase, and (iv) descriptions of any prior experiments that have failed and why they 
have failed. Answers to these topics allows expert system 100 to access information 
from experiments and data from the scientific.literature or from personal 
communications that can be relevant for the design of the sequence space of interest. 

Inference engine 106 can calculate a probability that a variant residue will 
provide a desired activity in an antibody of interest. The antibody can be an Fab 
fragment, (Fab)2 fragment, a scFv,fragment, a polynucleotide having its own activity 
of interest, a polynucleotide that encodes an antibody having an activity of interest, or 
a polynucleotide that encodes a polypeptide that is responsible for synthesis of an 
antibody having an activity of interest 

A profile 1 16 can be created by inference engine 106 based on probability 
scores and weighting factors. In some embodiments, inference engine 106 calculates 
the probability that defmed substitutions wiU result in an antibody having the desired 
fimction. for any variant of tiie reference antibody. For example, in some instances, 
knowledge base 108 can contain information describing residue positions in the 
reference sequence that exhibit a high degree of variance in homologs or among 
sequences in the same structural or sequence class. Inference engine 106 may thus 
give a high probability that substitutions at such positions will be active. One method 
of calculating the degree of amino acid variance is described by Gribskov, 1987, Proc 
Natl Acad Sci USA 84, 4355. As another example, in some instances, a sequence 
alignment can be available in knowledge base 108 to serve as tiie basis of a Hidden 
Markov model that can be used to calculate the probability that one specific residue 
will be followed by a second specific residue. These models also include probabilities 
for gaps and insertions. See, Krogh, "An introduction to Hidden Markov models for 
biological sequences," in Computational Methods in Molecular Biology, Salzberg et 
al., eds, Elsevier, Amsterdam. Such models can be used by inference engine 106 to 
calculate the probability that a particular substitution will possess a desired function. 

In some embodiments of the present invention, a variety of different 
substitution matrices 122 stored in knowledge base 108 can be used by expert system 
100 to identify suitable replacement residues .for positions likely to accept 
substitutions. Specifically, substitutions specific for antibody framework regions and 
antibody CDR regions can be generated using the sequences in tiie database. 
Additionally, substitutions based on tiie amino acid frequencies compiled for every 
CDR position for every antibody class in tiie kabat database can be derived. In 
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addition, the availabUity of a replacement residue that is likely to be functional can 
itself determine whether or not a position is likely to accept substitutions. This can be 
generated from functional sequences that are naturally occurring and./or generated 
synthetically whose properties have been measured. Substitution matrix 122 choices 
will impact the probabUity calculated for likely fimctionality of a variant. Thus, if 
mutations based on sequence aUgnment are desired, a substitution matrix 122 derived 
from the set of sequences should be chosen. Alternatively, if mutations tiiat depend 
on general mutabUity are desired, a substitution matrix 122 reflecting this need should 
be chosen. Substitution matrices 122 can be calculated based on tiie environment of a 
residue, e.g., mside or accessible, in coU or in beta-sheet. See. for example, 
Overington et al., 1992, Protein Sci 1:216. 

Methods to identify solvent accessible residues and to compute their solvent 
availabiUty are known in the art See. for example, Hubbard, Protem Eng 1:159 
(1987). Such calculated solvent availability can be used to determine which 
substitiition matrix 122 is used. More complex substitution matrices 122 that consider 
secondary structure, solvent accessibility, and the residue chemistry are also suitable 
for use in probabUity matrices. See. for example. Bowie & Eisenberg. Nature 356:83 
(1992). 

Conservation indices 124 stored m knowledge base 108 can be also be used by 
inference engine 106 to calculate probabUities that a substitution wiU result in an 
antibody witii desired properties. In this capacity, one can avoid mutating residues 
that are highly conserved, or conversely, focus mutations on conserved regions of the 
antibody. Algorithms for calculating conservation indices 124 at each position in a 
multiple sequence alignment are known in the art. See, for example, Novere et al., 
1999 Biophys. Journal 76:2329-2345. 

Inference engine 106 can also use knowledge of tiie effects of single mutations 
as a factor in calculating tiie probability that a substitution wiU possess a desired 
fimction when mutation effect data 126 is stored in knowledge base 108. Mutation 
effect data 126 can originate, for example, from mutagenesis scans or from those 
substitutions found in natiirally occurring variants tiiat affect tiie function of interest. 

Inference engine 106 can also use structural information 128 (e.g.. crystal 
structiire,insilico models of antibodies, de novo modeled antibody, etc.) stored in 
knowledge base 108. For example, inference engine 106 can assign higher 
probabilities to ammo acid residues in framework regions tiiat are close to die CDR of 
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an antibody, as will affect activity and/or specificity than more distant residues. 
Similarly, proximity to an epitope, proximity to an area of structural conflict, 
proximity to a conserved sequence, proximity to a binding site, proximity to a cleft in 
the protein, proximity to a modification site, etc. can be calculated fix)m structural 

5 information 128 and used to calculate the probability that a substitution wiU result in a 
functional antibody. To calculate the distance of a residue from a region of functional 
interest, physical distances obtained using a known crystal structure of the reference 
sequence can be used. Alternatively, molecular modeling approaches can be used. 
For example, the structure of the reference sequence can be predicted based on its 

10 homology to a known structure, and then used to calculate distances. Or the entire 
structure of the reference sequence can be predicted and distances tiien calculated 
from the predicted structure. 

In some embodiments structural information 128 is energy minimized. For 
example, the behavior of an antibody can be modeled using molecular dynamic 

15 simulations. In a specific example, a crystal structure or a predicted structure can be 

subjected to molecular dynamic simulation in order to model tiie effect of various 
external conditions such as the presence of solvent, the effect of temperature and ionic 
strength, upon the determined or predicted stiiicture. 

In addition to tiie examples of elements of information fliat can be used as a 

20 part of a knowledge base 1 08 described above, other mformation that can contiibute 
to an antibody knowledge base 108 tiiat can then be used by inference engine 106 of 
an expert system 100 to calculate the probabiUty that a substitution will possess a 
desired fimction include, but are not limited to, individual sequence analysis 
(including sequence complexity, sequence content and composition, internal base- 

25 pairing and secondary structure predictions) sequence comparisons (including 
stiucture-based sequence alignments, homology-based sequence alignments, 
phylogenetic comparisons based on multiple pairwise comparisons, phylogenetic 
comparisons based on principal component analysis of sequence alignments. Hidden 
Markov models), evolutionary molecular analysis, structiiral analysis (including tiiose 

30 using X-ray crystallographic data, nuclear magnetic resonance studies, structure 
threading algorithms, molecular dynamic simulations, active site geometry, 
determination of surface, internal and active site residues), known or predicted data 
relating sequence or structiire to functional mechanisms, chemical and biophysical 
properties of fimctional groups, known or predicted functional effects of changes (for 
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example information derived from the Protein Mutant Resource database, from an 
evolutionary comparison of sequence and activity data or from a comparison binding 
pockets and resdiues for the antibody with binding pockets and resdiues for other 
antibodies or sets of antibodies), substitution matrices derived from sequence 
, comparisons, mutations that are known or that can be predicted to affect physical 
properties of proteins (including stability, thermostability), known or predicted 
properties (including plasticity and tolerance to substitutions) of homologous or 
related antibodies (including other members of sequence, structurally or functionally 
related classes of antibodies), known or predicted immunological effects and 
constraints for specific sequence residues or motifs, known or predicted sequence 
effects on in vivo or m vitro post-translational or post-transcriptional modifications, 
known or predicted effects of the fimctional enviromnent (including other proteins, 
nucleic acids or other molecules contained within a cell), measured or predicted 
biochemical or biophysical properties (including crystallization), effects of sequences 
on the expression of nucleic acids or proteins (including known or predicted RNA 
splice sites, protein spUce sites, promoter sequences, transcriptional enhancer 
sequences, transcription and translation terminator sequences, sequences that affect 
the stability of a protein or nucleic acid, codon usage tables, nucleic acid.GC content). 
Sources of this information can include, without limitation, text mined from scientific 
literature, data mined from genomic sequences, expressed sequences, stx-uctural 
databases and in second and subsequent iterations of the process, case specific data 
from the fust points of the sequence space mapped. 

In some embodiments of the present invention, knowledge base 108 is 
optionally preprocessed for information by knowledge base editor 1 14. For example, 
knowledge base 108 can contain all available antibody sequences. During 
preprocessing by knowledge base editor 1 14, such sequences can be, for example, (i) 
aligned and distributed on a phylogenetic tree, (ii) grouped by principal component 
analysis (PCA). (iii) grouped by nonlinear component analysis (NLCA) (iv) grouped 
by independent component analysis (ICA), used to create sequence profiles (see. for 
example Gribskov, 1987. Proc Natl Acad Sci USA 84, 4355), (v) used to create 
Hidden Markov models or (vi) used to calculate structures prior to interrogation by 
the user (vii) classified into canonical structural classes as defmed by Chothis and lesk 
(REF). PCA, NLCA, and ICA is described in, for example, Duda et al. Pattern 
Classification, Second Edition, John Wiley & Sons, Section 10.13. which is hereby 
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incorporated by reference. 

In one embodiment, the output fix)m an expert system 100 will describe the 
various substitutions recommended by methods 130 based on assignment of scores, 
confidences, ranks, or probabilities (hereinafter "scores") using rules 120 in 
5 knowledge base 108. In preferred embodiments, these scores are cumulative. That is, 
every rule 120 used by a method 130 will assign a score to the substitution under 
consideration and these scores can be higher if more rules are satisfied. 

For example, Fig. 3 shows a series of steps that can be executed by expert 
system 100 in order to identify substitutions that are likely to increase the ability of an 
10 antibody to bind to a specific target antigen. Five independent methods 1 30 are 
shown for assessing the suitability of a substitution in the framework and CDRs: (i) 
substitutions from antibody sequences derived from other species and/or from 
synthetically derived antibodies and/or geimline sequences from human and/or other 
species(ii) substitutions from homologous and modeled structures, (iii) substitutions 
1 5 from substitution matrices, (iv) substitutions from principal component analysis 

(PCA) and (v) substitutions from bindmg pocket analysis. For each method 130, one 
or more rules (filters) 120 defined in knowledge base 108 are used. For example, 
' method (ii), substitutions fixjm homologous structures, uses two rules 120. The first 
rule 120 is an estimate of the mean root mean square deviation (RMSD) from the 
20 target structure for every five residue window of the homolog structure, and select 
framework sites that deviate from the target structure by more than three A. The 
second rule 1 20 identifies amino acid substitutions that are found in homologous 
sequences and select framework sites that are within five A of the complementarity 
determining region. In Fig. 3 rules 120 are applied as filters: a substitution that 
25 satisfies one of the rules is considered to have passed through that filter and receives a 
score. For example, in Fig. 3, this score is 1. The rules 120 used (applied) by the four 
other methods 130 for assessing the suitability of substitutions shown in Fig. 3 are 
also applied as filters. The score for each method can then be combined, for example 
by summing them. All possible substitutions can then be ranked in order of their 
30 cumulative scores. Although there are many variants, in some embodiments of the 
present invention, a component of step 02 of Fig. 2 uses the foUowing algorithm in 
order to identify suitable substitutions: 
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for each residue position j of the antibody identified in step 01 

^ for each possible substitution k of residuey 
{ 

initialize score;^; 

for each method m (method 130) in a suite of methods 
{ 

initialize scores 

for each filter n (rule 120) in method m 

^ compute filter w based on substitution A: at 

score„ = score™ + result of filter n; 

} 

scoreyjk= score/k + scorem 

} 

} . 
rank all scores/t 

Those substitutions that have satisfied more of the rules will have been 
assigned higher cumulative scores (score.), and those with the highest scores will be 
selected for incorporation into a set of antibody variants. 

There are many variations of ways to combine scores produced by two or 
more rulesl20. Variations are possible (i) in the methods of assigning scores, (ii) in 
the methods of combining scores, and (iii) in the methods of assigning different 
weights to scores produced by different rules 120. Rules 120 can also be combined 
on a case by case basis, using expert knowledge. These rules 120 can be stored in a 
knowledge base 108 and can be executed by inference engine 106 using user input 
acquired by questioning the user for requirements and knowledge via the user 
interfacel04. 



5.1.1 Variations in the Method of Assigning Scores 

In preferred embodiments, each rule 120 produces a reproducible quantitative 
value that can be used as a measure of the suitability of a substitution. However, there 
are many different ways in which quantitative scores can be obtained, and these ways 
can differ between different rules 120. A rule 120 can be used to produce an absolute 
quantitative score. This absolute quantitative score can be used directly, or it can be 
used to create a rank order list or a filter. As an example consider rule lb of Fig. 3. 
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Rule lb calculates the difference in free energy between a target antibody and an 
antibody containing a substitution. This value can then be used in several different 
ways to compare the favorability of different substitutions. For example, (i) the 
absolute value of the fi-ee energy difference (caused by the substitution) can be used, 
(ii) the free energy differences of all possible substitutions can be ranked in order of 
favorability, then a subset of substitutions that are predicted to be the most favorable 
can be selected and assigned a score, (iii) the score can be a single value assigned to 
all of the substitutions belonging to the subset of the most favorable, (iv) tiie score can 
be a measure of the rank order of the substitution, so that the most favorable 
substitutions receive a higher score than those that are calculated to be less favorable, 
(v) a rule can also be used to rank ail possible substitutions in order of predicted 
favorability and then eliminate a subset of these substitutions that are predicted to be 
tiie least favorable. In option (v), substitutions that were eliminated would receive a 
score of zero. 

A way in which the predicted free energy change of a substitution can be used 
as a rule to obtain quantitative measures of tiie favorability of a substitution has been 
described. An absolute quantitative value obtained by any method for fevorability 
can also be transformed by use of a fimction. In tiie case of free energy change, 
instead of using tiie free energy change itself the exp(free energy change) or step 
functions tiiat can reflect (iii) above can be used. One of skill m tiie art will appreciate 
tiiat tiiere are otiier rules tiiat can be applied to assess tiie effect of a substitution in 
order to produce absolute quantitative scores and all such otiier rules are included 
within the scope of the present invention. 

5.1.2 Variations in the Method of Combining Scores 
The scores produced by individual rules can be combined in a variety of ways. 
In some embodiments they are added togetiier in tiie manner illustrated in tiie 
algoritiun illustrated in Section 5.1 above. In some embodiments, the scores are 
multiplied together. For example, 

for each residue position j of the antibody identified in step 01 

for each possible substitution k of residue; 
{ 

initialize score^A; 
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for each method m (method 130) in a suite of methods 
{ 

initialize scorcm 

for each filter n (rule 120) in method m 

^ compute filter n based on substitution ft at 

P°''^°''^'' score™ = score™ x result of filter /»; 

} 

score/k = scoreyifc + scorCm 
} 

} 

rank all scores;* 

In some embodiments, one or more rules 120 can be used as a filter, so that 
only substitutions passing the one or more filter are used, regardless of their scores 
from the other rules. For example, 

for each residue position^ of the antibody identified in step 01 

^ for each possible substitiition * of residue; 
{ 

initialize score,*; 

set abort false . ^ , 

■ for each metiiod m (method 130) m a smte of methods 

{ 

initialize scorem 

for each filter n (rule 120) in method m 

compute filter n based on substitution k at 

P°^'*'°''-'' if result of filter n is negative { 

set abort true 
break; 

scorCm = score„, + result of filter n; 



) 

if abort 
{ 



} 

else 
{ 



set score;* = 0 
break; 



score/A- = score/* + scorem 

/* or score = f (scorejk)* r(scoreni)*/ 

/* scores can be functionally transformed or normalized 



26 



wo 2005/012877 



PCT/US2004/024751 



} 

} 
} 

rank all scores/j^ 

In some embodiments, a cumulative score can be produced by any 
mathematical function of tiie scores produced by two or more individual rules. For 
example, 

for each residue position j of the antibody identified in step 01 

^ for each possible substitution k of residue j 
{ 

initialize score;t; 

for each method m (metiiod 130) in a suite of metiiods 
{ 

initialize scoreg, 

for each filter n (rule 120) in method m 

compute filter n based on substitution k at 
position;, = score™ + weight,, x (result of filter n); 

} 

score/* = score/t + scorem 

} 
} 

rank ail scoresyt 

In the exemplary algorithm above weight™ is some rule 120 specific weight 
that is independentiy assigned to a rule. Such weights can be stored in knowledge 
base 108 and adjusted by an expert using knowledge-base editor 114 (Fig. 1). 

In addition, prior to combination, scores produced by individual rules can be 
scaled or normalized and/or tiansformaed by a matiiematical function to facilitate 
their combination. For example, in tiie case humanization of RSV-19, tiie mutation 
I46V was identified as the most favorable substitution in tiie framework region by 
combining tiie scores from methods 130. The distance, expressed in firaction of amino 
acid differences, was transformed and a Poisson conrection (-log[l -fraction]) applied 
and multiplied by tiie product of the absolute scores obtained from the otiier metiiods 
130. The resulting scores for all substitutions were ranked and I46V (combination 
score 126) was ranked 1 . In this example different criteria were used to compute the 
scores for the framework and CDR regions. 
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5.1.3 Variations in the Method of Assigning Weights to Scores From 
Different Rules 

As indicated in Section 5.1.2, the scores produced by individual rules 120 can 
be assigned different weights prior to being combined. For example, if the total score 
for a substituting monomer x at position i (Six) is obtained by adding the scores 
obtained by applying n different rules, the score can be expressed by Equations (1) or 
(2): 

(Eq. 1) Six = W,ixRr+ W2ixR2 + WjixRs + W4xR4 + WjixRs + + W„ixR« 
where, 

ixR„ is the score given by rule n for substituting monomer x at position 

i; and 

W„ is a weight applied to the score given by rule n. 

(Eq.2)Six = f(W,R,(ix).W2R2(ix). WjRi(ix)) .. 
where, 

Rj(ix) is the score given by rule j for substituting monomer x at position 

i; 

Wj is a weight applied to scores given by rule j; and 
f is some mathematical function 

Rules (and weights) can be (i) specific for a substitution of monomer x at a 
specific location, (ii) specific for position for any and/or a group of monomer 
substitution(s), (iii) specific for any and/or a group of positions for a specific 
monomer x, (iv) specific for any substitutions derived firom a particular and/or a group 
of homologs. (v) or specific for any position derived firam a particular and/or a group 
of homologs. 

The use of weights to modify scores obtained using different rules has a 

number of benefits. 

Firstly, the use of weights to modify scores obtained using different mles 120 
allows different rules 120 to have different degrees of influence over tiie final score 
for a substitution. For example if Rule 4 is the most important in determinmg the 
suitability of a substitution in a particular antibody, tiien tiiis rule can be made to 
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dominate the total score for the substitutions by making W4 much higher than the 
other weights. 

Secondly, the use of weights to modify scores obtained using different rules 
120 aUows different rules 120 to have different degrees of influence over the fmal 
score for a substitution depending upon the class or subclass of antibodies being 
considered. For example a rule 120 considering the structural effect of a substitution 
can be most important for engineering an antibody, while a rule 120 considering the 
statistical likelihood of a substitution using a substitution matrix can be most 
important for engmeering a protease. In this case, by first determining to which class 
of antibody the target antibody belongs, expert system 100 can then be used to assign 
weights to the scores firom different rules 120 that wUl result in the most accurate 
assessment of the favorability of substitutions. Moreover, as previously described, 
expert system 100 can assign different weights to different methods, to produce more 
control over how substitutions scores are computed. 

Thirdly, the use of weights to modify scores obtained using different rules 120 
allows expert system 100 to incorporate information obtained from previous 
experiments. For example, another aspect of the invention involves the use of 
sequence-activity relationships to empirically measure the contribution of 
substitutions to one or more activity of an antibody. This aspect of the invention is 
described more fully in Section 5.5. This sequence-activity determination effectively 
creates a feedback loop by which weights assigned to the scores from different rules 
120 applied by expert system 100 can be adjusted. As an example, consider the case 
in which 20 substitutions within an antibody (represented by 8,-820) receive fmal 
combined scores C-Czo from expert system 100. A set of antibodies that contain 
these substitutions are synthesized, and a sequence-activity relationships derived 
using wet lab assays. The sequence-activity relationships are used to determine actual 
scores that measure the fitness of each substitution for the desired activity of the 
antibody (F,-F2o). The weights applied to each rule 120 and/or method 130 can then 
be adjusted so that the observed fitness of each substitution, F1-F20, correlate more 
closely with scores C-Czo produced by expert system 100. In some embodiments, 
this correlation is the correlation between the absolute values of the scores for each 
substitution from expert system and the observed fitness of each substitution derived 
from die sequence-activity relationship. In some embodiments, die correlation can be 
a correlation between tiie rank order of effect of substitutions predicted by expert 
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system 100 and the rank order of substitutions observed or derived from the sequence- 
activity relationship. The weights appUed to each rule 120 can also be adjusted so 
that the correlation between the observed fitness of substitutions and the scores 
produced by expert system 100 is maximized for more than one set of substitutions, in 
one or more different target antibodies. 

Different classes of antibodies can optionally be used to provide different sets 
of substitutions for comparing observed fitness and scores produced by expert system 
100. -niis allows different weights to be calculated to apply to the scores produced by 
differentrules 120asafunctionof antibody class. One skilled in the art wiU 
appreciate that there are many possible variations of using experimental results to 
adjust weights applied to rule 120 scores. All such variants, whose predictive scoring 
fimctions can be adjusted based upon experimental data, are within the scope of the 
expert systems 100 of the present invention and can tiius be considered systems 
capable of learning. 

15 Because of the capacity for expert systems 100 of tiie present invention to 

learn by. for example, adjustment of rule 120 weights, in some instances it can be 
desirable to select substitutions tiiat are favored strongly by different rules 120. Such 
selection can facilitate tiie establishment of the appropriate weights to be applied to 
different rules 120 used by expert system 100. 
20 The score for a substitution based on two or more rules can be calculated 

independently or using conditional probabilities. An expert system 100 can produce 
scores for at least 2. 3, 4. 5. 6, 7. 8. 9. 10. 1 1. 12. 13. 14, 15. 16. 17. 18. 19. 20. 25. 30. 
35. 40. 45. 50, 60, 70, 80, 90 or 1 00 positions in tiie reference sequence up to die 
entire sequence, and can include contiguous residues or noncontiguous residues or 
25 mixhires thereof. The expert system 100 can include at least 2. 3, 4, 5. 6, 7. 8, 9, 10. 
11 12 13. 14, 15. 16. 17. 18. 19. 20, 25, 30, 35, 40, 45 or 50 different residues. 
Natur^ly occurring residues can be included in the expert system, as weU as mmatiiral 
residues for syntiietic methods, and combinations thereof. 

In anotiier embodiment of die invention, the above calculations can be 
30 performed by an expert with access to the relevant knowledge base 108. for example. 

by using user interface 104. 

Examples of die ways in which such expert system 100 can be used to 
automatically select substitutions to make in an antibody will now be described in the 
following sections widi reference to Figs 1 and 2. The foUowing exemplary process 
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is intended to illustrate one possible embodiment of the invention. One skilled in the 
art will recognize that there are many possible variations on this theme, and the 
following is not intended to limit the present invention. The selection process refers 
to the scheme shown in Fig. 3. 

Fig. 3 shows a series of independent rules 120, each of which can be used to 
produce a score for any possible amino acid substitution in an antibody. In one 
embodiment of the invention, all possible single substitutions can be enumerated 
computationally and then scored according to one or more of the rules executed by 
expert system 100. 

5.1.4 Rules Based on Substitutions from Related Antibody Sequences 



One source of information that can be used to construct rules 120 that assess 
the likely effect of amino acid substitutions upon one or more activities of an antibody 
is the sequence of one or more homologous or related antibodies. See, for example, 
1 5 Fig. 3, rule 3a. Homologous sequences are generally analogous functionaUy and 

structurally, altiiough having been subjected separately to different selective pressures 
they are also likely to be optimized differently. Antibody sequences variants can also 
be generated in tiie lab using many techniques and sequence, properties of several 
such antibodies are avaUable in tiie database and literature. Amino acids that differ 
20 between homologous sequences thus provide a guide to substitutions that are Ukely to 
yield functional tiiough different antibody sequences. For humanization of antibodies, 
aUgnment of the the target antibody with human germline sequences available in the 
databases is used to identify residue in the human framework. The sequences can be 
grouped into classes as defined by Chothia and Lesk (Chotiiia C, Lesk AM, 
25 "Canonical structiues for the hypervariable regions of immunoglobulins." J Mol Biol. 
1987 Aug 20;196(4):901-17). Alignment of homologous sequences can therefore be 
used to identify candidate substitiition positions. 

In one approach, homologous antibody sequences or sequence classes are 
aligned (e.g., by using using clustalw; Thompson et al, 1994, Nucleic Acids Res 22: 
30 4673-80) and tiien a phylogenetic tree is reconstixicted. Conservation indices can tiien 
be calculated for each site {e.g., Dopazo. 1997, Comput Appl Biosci 13: 313-7) and 
the information content calculated for each site (e.g., Zhang, 2002, J Comput Biol 9: 
487-503). These scores can be exhaustively calculated for every position in tiie 
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antibody. The scores reflect the extent of tolerance to substitutions in the antibody at 
each position. The scores can be normalized using the phylogenetic ttee to eliminate 
bias in the homolog sequences found in databases (for e.g. ease of access to certam 
template DNAs results in sequences from certain class of organisms dominates the 
5 database.) Scores for a given alignment can also be normalized to have an average 
value of 0.0 and a standard deviation of 1.0, or other standard procedures can be used 
to compare and combine scores from multiple methods. These values can then be 
used directly as a score, as outlined above and in Equation (1) or Equation (2). In 
some embodiments, all sites with a score above a certain threshold value can be 
10 selected. For example, a cutoff (threshold) of 0.0 can be chosen (which is set to be 
the average score). In still other embodiments, aU sites with a score below a certain 
threshold value can be eliminated. In some embodiments, the most variable {e.g., 
least conserved) sites can be selected by ranking the sites in order of these scores. For 
example the most highly scoring site can be selected, or the 2, 3. 4. 5, 6. 7, 8, 9. 10, 
15 11, 12, 13. 14, 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 
34, 35, 36, 37, 38, 39. 40. 50. 60. 70. 80. 90 or 100 most highly scoring sites can be 
selected. In some embodiments the least variable (e.g.. most conserved) sites can be 
eliminated by ranking the sites in order of these scores. For example, the least highly 
scoring site can be eUminated, or the 10. 20. 30. 40. 50. 60. 70. 80. 90. 100, 110. 120. 
20 130 140,150,160, 170,180,190,200.210,220,230,240.250,260,270.280.290. 

300. 310. 320, 330. 340, 350, 360, 370, 380. 390. 400. 500. 600. 700. 800. 900 or 
1000 least highly scoring sites can be eliminated (Fig. 3, Rule la). 

Amino acid diversity and tolerance at each site can be measured as a fitness 
property of each amino acid at eveiy location. In this approach we all related antibody 

25 sequences available can be considered. The most fit residue for that position carries a 
higher value(e.g.. Koshi et al.. 2001. Pac Symp Biocomput 191-202; O. Soyer, M.W. 
Dimmic, R.R. Neubig. and R.A. Goldstein; Pacific Symposium on Biocomputing 
7:625-636 (2002). Sites can be grouped into site-classes or treated independently. 
Sites and site classes most fit to change based on the substitution rate and the 

30 substitutions most favorable based on the fitaess can be selected (Fig. 3, Rule 2a). In 
some embodiments, these values of fitness can then be used directly as a score, as 
outlined above and in Equation (1) or Equation (2). In some embodiments all sites 
with a score above a certain threshold value can be selected. For example, a cutoff 
(threshold) of 0.0 can be chosen (when the normalization of scores sets the wild type 

32 



( 

WO 2005/012877 



PCT/US2004/024751 



residue found in the reference to be 0.0. In some embodiments, all sites with a score 
below a certain threshold value can be elimmated. Threshold values of 0.0 or below 
can be eliminated, thereby only including amino changes that have a higher fitness 
value that the reference wild type amino acid found in that position. In some 

5 embodiments, the sites most tolerant to change can be selected by ranking the sites m 
order of these scores. For example, the most highly scoring site can be selected, or 
the 2, 3. 4, 5, 6, 7. 8, 9. 10, 11, 12, 13, 14, 15, 16. 17, 18, 19. 20, 21, 22, 23, 24, 25, 
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36. 37, 38. 39, 40, 50, 60, 70, 80. 90 or 100 most 
highly scoring sites may be selected. In some embodiments, the sites least tolerant to 

10 change can be elimmated by ranking the sites in order of these scores. For example, 
the least highly scoring site can be eliminated, or the 10, 20, 30, 40, 50, 60, 70, 80, 90, 
100, 110, 120, 130, 140, 150, 160, 170. 180. 190, 200, 210, 220, 230, 240. 250, 260, 
270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 500, 600, 700, 
800, 900 or 1000 least highly scoring sites can be eliminated. 

15 For example, in the study of G-protein coupled receptors(GPCR) by 

Soyer et. al. (0. Soyer, M.W. Dimmic, R.R. Neubig, and R.A. Goldstein; Pacific 
Symposium on Biocomputing 7:625-636 (2002)), using the 8-site class model the 
class #8 was identified to have tiie highest substitution rate and the property 
correlating with fitness of amino acids at these positions was identified to be "charge 

20 transfer" propensity of the amino acid. In the present invention, amino acids in flie 
sites that carry a higher relative fitness compared to the wild type ammo acid found m 
that position are identified as suitable for substitution. The scores for these residues 
will be higher and can be combined with other methods 130. 

Scores can also be assigned to residues from related sequences that are 

25 classified into the same canonical class as the target antibody. In this approach 

substitutions that are derived fi-om sequences that are part of the same Chothia-Lesk 
canonical class can be scored (Fig. 3 Rule 3a). 

5.1.5 Rules Based on Substitutions From Related Antibody Structures 

30 The structures of many antibodies and their variants are also available in the 

RCSB protein data bank ((2002) Acta Cryst. D 58 (6:1), pp. 899-907); and Structural 
Bioinformatics(2003); P. E. Bourne and H. Weissig, Hoboken. NJ, John Wiley & 
Sons, Inc. pp. 181-198. The availability of structures can help identify amino acid 
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changes that affect protein function. One way in which they can be used to do so is to 
avoid changes to the antibody of interest that will not be structurally tolerated by the 
antibody. Changes computed m-silico using energy fiinctions and force fields 
correlate with experimentally measured free energy changes in the stabilities of 
5 proteins. See, for example. Privalov et al, 1988, Adv Protein Chem 39: 191-234; 
Lee, 1993, Protein Sci 2: 733-8; Freire, 2001, Methods Mol Biol 168: 37-68; and 
Guerois et al., 2002, J Mol Biol 320: 369-87). Therefore, candidate amino acid 
changes can modeled into the structure(s) computationally and changes in the free 
energy computed. These computationally calculated changes in free energies 
10 resulting from the substitutions can then be used directly as a score, as outUned above 
and in Equation (1) or Equation (2). Alternatively, all changes can be selected that 
increase the free energy of the antibody by less than a certain value. For example. aU 
changes that would increase the free energy by less than IkCal/mol can be selected, 
all changes that would increase the free energy by less than 1 .5 kCal/mol can be 
1 5 selected, all changes that would increase the free energy by less than 2kCal/mol can 
be selected, or aU changes that would mcrease the free energy by less than 
2.5kCal/mol can be selected. In some embodunents, all changes can be elunmated 
that increase the free energy of the antibody by more than a certain value. For 
example, all changes that would mcrease the free energy by more than IkCal/mol can 
20 be eliminated, all changes that would increase the free energy by more than 1.5 

kCal/mol can be eliminated, all changes that would increase the free energy by more 
than 2kCal/mol can be eliminated, all changes that would increase the free energy by 
more than 2.5kCal/mol can be eliminated. In some embodunents. the best tolerated 
substitutions can be selected by rankmg the sites in order of the predicted increase in 
95 free energy. For example, the substitution with the lowest increase in free energy can 
be selected, or the 2. 3, 4, 5, 6, 7. 8, 9. 10. 11. 12. 13. 14. 15. 16, 17, 18. 19. 20. 21. 
22. 23. 24, 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37, 38, 39, 40, 50. 60, 70, 80, 
90 or 100 substitution with the lowest increase in free energy may be selected. In 
some embodiments, the substitutions with the greatest increases in free energy can be 
30 eliminated by ranking the sites in order of these scores. For example, the 1 0, 20. 30. 
40, 50. 60, 70. 80, 90, 100, 110, 120, 130, 140, 150. 160, 170, 180. 190, 200. 210. 
220, 230, 240, 250, 260, 270, 280. 290, 300, 310. 320, 330, 340, 350, 360, 370, 380, 
39o', 40o', 500, 600, 700. 800. 900. 1000, 2000. 3000, 4000. 5000, 6000. 7000, 8000, 
9000, 10000, 12000, 14000, 16000, 18000 or 20000 substitutions with the greatest 
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increases in free energy can be eliminated (Fig. 3, Rule lb). [CHANGE FIG.] 
In alternative embodiments, multiple changes can be modeled into the 
structure(s) computationally and changes in the free energies resulting from the 
substitutions computed. These free energy values can be used to identify changes that 
are "valid" independently, but not together. Amino acid changes that are independent 
can be selected preferentially. Amino acid clashes that yield a higher free energy 
when compared to the fi^e energies produced by modeling changes separately can be 
eliminated. 

Regions of the antibody that differ structurally between antibodies are more 
likely to tolerate change, while those regions that are structurally conserved are likely 
to be less tolerant. Structures can be direcUy obtained from the database or predicted 
using various structure modeling software packages. Structures of homologs and 
mutants can be superposed on the wild type structure. See, for example, May et al, 
1994, Protein Eng 7: 475-85; and Ochagavia et al, 2002, Bioinformatics 18: 637-40). 
Structural conservation can be calculated as the root mean squared (RMS) deviations 
of the backbones of the superposed chains. This can be computed as the deviations of 
individual residues, or more preferably as the deviations of a running average over a 
between two and ten residue sfretch of the backbone between the target antibody and 
one or more homologous antibodies. These computationaUy calculated RMS 
deviations for every position between homologous structures can then be used directly 
as a score, as outlined above and in Equation (1) or Equation (2). In some 
embodiments, RMS deviations between the alpha carbons (or backbone atoms) in the 
structure of the target antibody and one or more homologous or related antibodies that 
are greater than a threshold value can be considered structurally labile and these sites 
can be selected. This threshold RMS deviation between homologous structures can 
be greater than 2A , 2.5A , 3 A, 3. 5 A, 4A, 4.5 A , 5A. 

In some embodiments, RMS deviations between the alpha carbons in the 
structure of the target antibody and one or more homologous or related antibodies that 
are less than a threshold value can be considered structuraUy conserved and these sites 
can be eliminated. This threshold RMS deviation between homologous structures can 
be less than 2A , 2.5A , 3A, 3.5A. 4A, 4.5A , or 5A. 

In some embodiments sites can be ranked in order of the calculated RMS 
deviations between the alpha carbons in the structure of the target antibody and one or 
more homologous or related antibodies and those with the highest calculated RMS 
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deviations selected. For example, the site with the highest calculated RMS deviations 
between homologous structure can be selected, or the 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14. 15, 16, 17, 18, 19, 20, 21, 22, 23. 24, 25, 26, 27. 28, 29. 30, 31. 32. 33, 34. 35, 
36, 37, 38, 39, 40, 50, 60, 70, 80, 90 or 100 sites with the highest calculated RMS 
5 deviations between homologous structure may be selected. 

In some embodiments, sites can be ranked in order of the calculated RMS 
deviations between the alpha carbons in the structure of tiie target antibody and one or 
more homologous or related antibodies and those witia tiie lowest calculated RMS 
deviations eliminated. For example, tiie site with tiie lowest calculated RMS 
10 deviations between homologous structures can be eluninated or tiie 2, 3, 4, 5, 6, 7, 8, 
9, 10, 11, 12, 13, 14. 15. 16, 17, 18. 19. 20. 21. 22, 23, 24, 25, 26, 27, 28, 29, 30, 31. 
32. 33, 34. 35, 36, 37. 38. 39. 40, 50, 60, 70. 80. 90 or 100 sites witii tiie lowest 
calculated RMS deviations between homologous structiire can be eliminated (Fig. 3. 
Rule 2b). Changes near binding sites (and CDRs) are highly likely to influence tiie 
15 activity of tiie antibody and are good candidates for substitution. All amino acid 
substitutions tiiat are found in one or more variants can be tested for proximity to a 
binding or regulatory site of tiie antibody. In some embodiments, tiie distance 
between an amino acid substitution ttiat is found in one or more homologs from a 
binding or catalytic or regulatory site can be used directly as a score, as outiined 
20 above and in Equation (1) or Equation (2). Alternatively, in some embodiments, all 
amino acid substitutions tiiat are found in one or more homologs and tiiat are witiiin a 
tiireshold distance of a binding or regulatory site in tiie antibody can be selected. This 
threshold distance can be less than 2A , 2.5A , 3A. 3.5A, 4A, 4.5A , 5A. 5.5A, 6A, 
6.5A , 7A. In still otiier embodiments, all amino acid substitutions tiiat are found in 
25 one or more homologs and ttiat are beyond a tiireshold distance of a binding or 

regulatory site in tiie antibody can be eliminated. This tiu-eshold distance can be more 
tiian 2A , 2.5A , 3A, 3.5A, 4A, 4.5A , 5A. 5.5A. 6A. 6.5A , or 7A. In still otiier 
alternative embodiments, all amino acid substitutions tiiat are found in one or more 
homologs can be ranked in order of proximity to a binding or regulatory site in tiie 
30 protein and tiiose tiiat are closest to tiie binding or regulatory site selected by a rule 
120. For example, tiie substitution closest to tiie binding or catalytic or regulatory site 
can be selected, or between 2 and 20. between 10 and 100, or the top 200 substitutions 
closest to tiie binding or catalytic or regulatory site can be selected. In still otiier 
alternative embodiments, all amino acid substitutions that are found in one or more 
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homologs can be ranked in order of proximity to a binding or regulatory site in the 
antibody and those that are farthest from the binding or catalytic or regulatory site 
eliminated. For example, the substitution farthest from the bmding or regulatory site 
can be eliminated. In some embodiments, between 2 and 20, between 10 and 100, or 
the top 200 substitutions farthest from tiie binding or regulatory site can be 
eliminated. 

5.1.6 Rules Based on Substitutions from Substitution Matrices 

Another source of information that can be used to construct rules 120 that 
assess the likely effect of amino acid substitiitions upon one or more activities of an 
antibody is the frequency with which one amino acid is observed to substitute for 
anotiier amino acid in different proteins. The matrbc can be expressed in terms of 
probabiUties or values derived from probabilities by mathematical transformation 
involving probabilities of transitions or substitutions (Pij) and observed frequencies of 
amino acids(Fi). Matrices using such transformation include scoring matrices like 
PAMIOO, PAM250. and BLOSUUM etc. See, for example. Fig. 3, rule Ic. 
Substitution matrices are derived from pairwise alignments of protein homologs from 
sequence databases. They constitixte estimates of the probabiUty tiiat one amino acid 
will be changed to another while conserving fimction. Different substitution matrices 
we calculated from different sets of sequences. For example, they can be based on 
the structural environment of a residue (Overington. 1992, Genet Eng (NY) 14: 231- 
49.; and Overington era/., 1992. Protein Sci 1: 216-26.) or on additional factors 
including secondary stmchire. solvent accessibility, and residue chemistry (Luthy et 
al., 1992, Nature 356: 83-5. Substitution matrices can be derived for specific sites or 
group of sites in the antibody. Specifically, substitutions specific for antibody 
framework regions and antibody CDR regions can be generated using the sequences 
in the database. Additionally, substitutions can be derived based on the amino acid 
frequencies compUed for every CDR position for every antibody class in the kabat 
database. 

A substitution matrix that best captures the observed sequences in the antibody 
family of interest can be calculated using the Bayesian method developed by 
Goldstein et al. (Koshi et al, 1995, Protein Eng 8: 641-645) and used to score all 
candidate substitutions. 

In some embodiments tiiese values can then be used directiy as a score, as 
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outlined above and in Equation (1) or Equation (2). Hxe scores can expressed as Pij: 

the probability of substituting residue i with j. Any transformations of Pij can also be 

used Pij can be computed for a specified evolutionary distance. In alternative 

embodiments, all substitutions with a probabUity above a certain threshold value may 

5 be selected. Threshold values of 0.00001.0.00001.0.0001,0.01 or 0.1 canbe used for 

probabUities and/or threshold values of -5,-4,-3.-2.-1.0.1.2.3.4.5 for any PAM matrix. 

In still other embodiments, all substitutions with a probabiUty below a certain 
threshold value may be eliminated. Threshold values ofO.00001.0.00001,0.0001.0.01 

or 0 1 can be used for probabilities and/or threshold values of -5,-4,-3,-2,-1.0.1.2,3,4,5 
10 for any PAM matrix In still other embodiments, the most favorable substitutions can 
be selected by ranking substitutions in order of their substitution matrix probabihty 
scores For example, the most highly scoring substitution can be selected, or the top 
2 3 4 5. 6. 7. 8. 9. 10, 11. 12. 13. 14. 15, 16. 17. 18. 19, 20. 21. 22. 23. 24, 25, 26. 
27 28 29 30. 31. 32. 33. 34. 35. 36, 37. 38, 39, 40. up to 50. up to 60. up to 70. up to 
15 8o' up to 90. up to 100. up to 1 10. up to 120. up to 130, up to 140, up to 150. up to 
160 up to 170. up to 180. up to 190, up to 200. up to 210. up to 220. up to 230. up to 
24o' up to 250. up to 260, up to 270. up to 280. up to 290. up to 300, up to 3 10. up to 
320 up to 330. up to 340. up to 350. up to 360. up to 370. up to 380. up to 390. up to 
400 up to 500. up to 600. up to 700. up to 800, up to 900, up to 1000, up to 2000. up 
20 to 3000. up to 4000, up to 5000, up to 6000. up to 7000. up to 8000. up to 9000. up to 
lOOOO up to 12000. up to 14000, up to 16000, up to 18000 or up to 20000 most 
highly scoring substitutions can be selected. In stiU other embodiments, the least 
favorable substitutions can be eliminated by ranking substitutions in order of their 
substitution matrix probability scores. For example, the least substitution with the 
25 lowest substitution matrix probability may be eliminated, or the 2. 3. 4, 5, 6. 7. 8. 9, 
10 11. 12. 13. 14. 15. 16. 17. 18, 19, 20, 21. 22, 23, 24, 25. 26. 27. 28. 29. 30. 31. 32. 
33 34 35. 36. 37. 38. 39. 40. up to 50. up to 60. up to 70, up to 80. up to 90. up to 
100 upto n0,upto I20.upto 130.upto HO.upto 150.upto 160,upto 170,upto 
180 up to 190. up to 200. up to 210. up to 220. up to 230. up to 240. up to 250, up to 
30 260. up to 270. up to 280. up to 290. up to 300. up to 310. up to 320. up to 330. up to 
340 up to 350. up to 360. up to 370. up to 380. up to 390. up to 400. up to 500. up to 
600 up to 700. up to 800. up to 900. up to 1000. up to 2000. up to 3000. up to 4000. 
up to 5000. up to 6000. up to 7000. up to 8000. up to 9000. up to 10000. up to 12000. 
up to 14000, up to 16000. up to 18000 or up to 20000 substitutions with the lowest 
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substitution matrix probability can be eliminated. 

A substitution or a scoring matrix can be calculated by considering 
homologous and/or related antibodies from many different antibody classes (e.g.. 
Benner et al, 1994. Protein Eng 7: 1323-1332; and Tomii et al, 1996, Protein Eng 9: 
5 27-36) can be used to score all candidate substitutions. In some embodiments, these 
values can then be used directly as a score, as outlined above and in Equation (1) or 
Equation (2). In some embodiments, all substitutions with a probability above a 
certain threshold value can be selected. Threshold values of 
0.00001,0.00001,0.0001,0.01 or 0.1 can be used for probabilities and/or threshold 
10 values of -5. -4, -3, -2, -1. 0, 1. 2, 3, 4. 5 for any PAM matrix can be used. In still 

other embodiments, all substitutions with a probability below a certain threshold value 
canbeeUminated. Threshold values of 0.00001,0.00001,0.0001,0.01 or 0.1 can be 
used for probabilities and/or threshold values of -5, -4, -3, -2, -1, 0, 1 , 2, 3, 4, 5 can 
be used for any PAM matrix. In stiU other embodiments, the most favorable 
15 substitutions can be selected by ranking substitutions in order of their substitution 
matrix probability scores. For example the most highly scoring substitution may be 
selected, or the 2. 3, 4, 5. 6. 7, 8, 9. 10. U, 12. 13, 14. 15. 16. 17. 18. 19. 20. 21. 22. 
23, 24, 25. 26, 27, 28, 29. 30. 31. 32. 33. 34, 35, 36. 37, 38. 39, 40, up to 50, up to 60, 
up to 70, up to 80. up to 90. up to 100, up to 110. up to 120, up to 130, up to 140,* up 
20 to 1 50, up to 160, up to 170, up to 180, up to 190. up to 200, up to 210. up to 220. up 
to 230. up to 240, up to 250. up to 260, up to 270, up to 280, up to 290, up to 300. up 
to 3 1 0, up to 320, up to 330, up to 340, up to 350, up to 360. up to 370. up to 380. up 
to 390. up to 400, up to 500. up to 600, up to 700, up to 800. up to 900. up to 1000. up 
to 2000. up to 3000, up to 4000, up to 5000. up to 6000, up to 7000, up to 8000, up to 
25 9000, up to 10000. up to 12000, up to 14000, upto 16000. up to 1 8000 or up to 20000 
most highly scoring substitutions can be selected. In stiU other embodiments, the 
least favorable substitutions can be eliminated by ranking substitutions in order of 
their substitution matrix probability scores. For example, the least substitution with 
the lowest substitution matrix probability may be eliminated, or the 2. 3, 4. 5. 6, 7, 8, 
30 9. 10. 1 1. 12. 13. 14. 15. 16. 17, 18. 19. 20. 21. 22. 23. 24. 25. 26, 27, 28, 29. 30. 31, 
32, 33, 34, 35. 36. 37, 38. 39, 40. up to 50. up to 60, up to 70. up to 80, up to 90. up to 
100, up to 1 10, up to 120, up to 130. up to 140, up to 150. up to 160. up to 170, up to 
180, up to 190, up to 200, up to 210, up to 220, up to 230, up to 240, up to 250, up to 
260, up to 270. up to 280. up to 290. up to 300. up to 310. up to 320. up to 330. up to 
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340 up to 350, up to 360, up to 370. up to 380. up to 390. up to 400. up to 500. up to 
600 up to 700, up to 800, up to 900, up to 1000, up to 2000. up to 3000. up to 4000. 
up to 5000. up to 6000, up to 7000, up to 8000, up to 9000. up to 10000. up to 12000. 
up to 14000, up to 1 6000, up to 1 8000 or up to 20000 substitutions with the lowest 
substitution matrix probability can be eliminated. 

5.1.7 Rules Based on Substitutions from Principal Component Analysis of 
Sequence Alignments 

Antibody sequences can be mathematically represented in terms many 
variables, each variable representing tiie type of amino acid at a specific location. For 
example the sequence AGWRY can be represented by 5 variables, where variable 1 
assumes a value of "A" corresponding to position 1. variable 2 is "G" corresponding 
to position 2 and so on. Each variable can assume 1 of 20 possibilities. Alternatively 
each variable can also represent multiple positions (say 2) and assume 1 of 400 values 
(for 2 positions) corresponding to 20x20 = 400 combination of possible amino acid 
pairs. Alternatively, each position can assume a value conesponding to a physico- 
chemical property of the amino acid instead of amino acid identity. Alternatively, 
each variable can be a combination of variables representing properties of ammo 
acids. Alternatively, each variable can be represented in a binary fom, correspondmg 
to presence or absence of a particular amino acid. Alternatively, each variable can be 
represented in a binary form corresponding to presence or absence of a defmed group 
of amino acids. 

Typical antibodies contain many hundred variables. A set of antibodies are 
various points in the variables space, and relationships between various antibodies can 
be represented in terms of tiie values of tiie variables corresponding to those 
antibodies. In such a high-dimension space (due to high degree of variables) 
antibodies can be clustered and classified using statistical techniques like the principal 
components analysis, k-means clustering, SVM etc. 

Using such methods, particularly but not limiting to Principal Component 
I Analysis (PCA). we can classify sequences and identify residues that differentiate 
various related antibody sequences and tiieir functions. Typical antibody sequence 
aligmnents contain many amino acid positions at which differences occur, leading to a 
high number of dimensions required to represent the sequence space. A sequence 
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alignment can be subjected to principal component analysis to identify new composite 
dimensions that describe and visualize a significant fraction of the variation between a 
set of sequences. The new dimensions (the principal components) can also be 
described in terms of the contributions of each monomer variation within the original 
sequence alignment to that dimension (the "loads"). Typically a single principal 
component contains contributions from tens or hundreds of different monomer 
differences within a set of antibody sequences. One powerfiil application of principal 
component analysis is that it can be used to suggest a relationship between antibody 
sequence and function. Antibody sequence can be represented in terms of the 
principal components of that sequence. Principal components can then be identified 
in which antibodies are grouped fimctionally. The loads of those principal 
components can then be used to identify the monomers that are most responsible for 
the grouping of the antibodies within sequence space. These monomers are thus good 
candidates for substitutions likely to affect function. 

Thus for proteins, amino acid substitutions that are most important in 
differentiating and grouping sequences are often also those that functionally 
differentiate the proteins. Identification of such amino acids using dimension- 
reducing techniques such as principal component analysis has been described {e.g., 
Casari et al., 1995. Nat Struct Biol 2: 171-178; Gogos et al, 2000, Proteins 40: 98- 
105; and del Sol Mesa et al, 2003, J Mol Biol 326: 1289-1302). PCA can identify 
sequence features and substitutions corresponding to the desired phenotype of the 
protein and scores "loads" for these features in the direction of desired phenotype are 
used as absolute scores or as filters to identify substitutions. 

An example of the use of principal component analysis for identification of 
favorable substitutions is also shown in Figs 8-12. Fig. 8 shows the accession number 
of the list of 49 proteases whose sequences are homologous to proteinase K. A 
property of interest in this example is activity during or after exposure of the protein 
to heat. The 49 sequences were subjected to principal component analysis, and the 
distribution of the sequences in the furst two principal components is shown in Fig. 9. 
Proteases 46, 47, 48 and 49 were all obtained from thermostable organisms and can 
thus be expected to possess desirable thermostability properties. As shown in Fig. 9, 
these four proteases are grouped together in the first two principal components of the 
sequence space, characterized by strongly negative scores in botli principal 
components 1 and 2. Fig. 10 shows the contributions (the "loads") of all amino acid 
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differences within the aUgnment of the 49 proteases, to the new dimensions principal 
components 1 and 2. Fig. 11 shows an expanded detail of the lower left comer of Fig. 
10 in which the identities of each amino acid contributing to the principal components 
are now shown. These amino acids are those most responsible for giving a protein 
sequence a strong negative score in principal component 1 and principal component 2. 
These contributions are quantitated in Fig. 12. Because these scores are also those 
seen for proteases from thermophiUc organsisms, the amino acids that are primarily 
responsible for conferring these scores upon proteins are very good candidates for 
amino acids that may confer desirable properties, in this case thermostabUity. 

An example of the use of principal component analysis related to antibody 
humanization for identification of favorable substitutions is also shown in Figs 24-28. 
Fig. 24 shows the sequence identification number listed by locus of germline 
sequence from VBase (available at http://www.mrc-cpe.cam.ac.ukO. A properties of 
interest in this example are characteristics of sequence 4-28. The heavy chain of 
sequences listed in Fig. 24 along with the heavy chain of the murine antibody RSV19 
were subjected to principal component analysis, and the distribution of the sequences 
in the first two principal components is shown in Fig. 25. Sequence 4-28 is the 
sequence cluster containing sequences close in locus id. As shown in Fig. 25, these 
are grouped together in the first principal components of the sequence space, 
characterized by strongly positive scores in principal component 1. Fig. 26 shows the 
contributions (tiie "loads") of all amino acid differences within the aUgmnent, to the 
new dimensions principal components 1 and 2. Fig. 27 shows an expanded detail of 
the right center of Fig. 26 in which the identities of each amino acid confributing to 
the principal components are now shown. These amino acids are tiiose most 
responsible for giving a protein sequence a stirong positive score in principal 
component I. Some of these contributions are quantitated in Fig. 28. The amino 
acids that are primarily responsible for conferring tiiese scores upon proteins are serve 
as candidates for amino acids that may confer desirable properties, in this case 
characteristics of germline sequence 4-28. 

An example of the use of principal component analysis related to antibody 
maturation for identification of favorable substitutions is also shown in Figs 29-33. 
Fig. 29 shows the sequence identification number listed by locus of germline 
sequence from VBase (available at http://www.mrc-cpe.cam.ac.ukO. A property of 
interest in this example are characteristics of sequence 5-a. The heavy chains of 
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germline sequences along with the heavy chain of AAF21612 were subjected to 
principal component analysis, and the distribution of the sequences in the first two 
principal components is shown in Fig. 30. Sequence 5-a is m the sequence cluster 
containing sequences close to locus id 1-x. As shown in Fig. 30, these sequences are 

5 grouped together in the second principal component of the sequence space, 

characterized by strongly negative scores in principal component 2. Fig. 31 shows the 
contributions (the "loads") of all amino acid differences within the alignment, to the 
new dimensions principal components 1 and 2. Fig. 32 shows an expanded detail of 
the lower center of Fig. 31 in which the identities of each amino acid contributing to 

10 the principal components are now shown. These amino acids are those most 
responsible for giving a protein sequence a strong negative score in principal 
component 2. Some ofthese contributions are qxiantitated in Fig. 33. The amino 
acids that are primarily responsible for conferring these scores upon proteins are very 
good candidates for ammo acids that may confer desirable properties, in this case 

15 characteristics of germline sequence 5-a. 

Any sequence principal component can be used that contributes to 
differentiating between two sets of antibodies and that is likely to reflect some 
functional differences of interest. In some embodunents, tiie "load" contributed by a 
substitution to one or more such principal component of sequence can be used directiy 

20 as a score, as outiined above and in Equation (1) or Equation (2). In some 

embodiments, all substitiitions witii a "load" above a certain flireshold value can be 
selected. Threshold values can be determined fiom the distribution of load values. 
For example, select top ten percent positive loads in principal component 1 . In some 
embodiments, all substitutions witii a "load" below a certain threshold value can be 

25 eliminated. For example, eliminate tiie top ten percent of tiie negative loads in 

principal component 1. In still other embodiments, the substitutions with the highest 
loads can be selected by ranking substitutions in order of tiieir loads. For example, 
the substitution with die highest "load" can be selected, or the 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23. 24, 25, 26, 27, 28, 29, 30, 31. 32, 33, 

30 34, 35, 36, 37, 38, 39, 40, 50, 60, 70, 80, 90 or 100 substitutions witii the highest 

"loads" can be selected. In still otiier embodiments, the substitutions with the lowest 
loads can be eliminated by ranking substitutions in order of their loads. For example, 
the substitution with the lowest "load" can be eliminated, or the 2, 3, 4, 5, 6, 7, 8, 9, 
10, 1 1, 12, 13, 14, 15. 16, 17, 18, 19, 20, 21, 22, 23, 24. 25, 26, 27, 28, 29, 30, 31, 32, 
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33 34. 35. 36, 37, 38. 39, 40, , up to 50, up to 60, up to 70. up to 80. up to 90, up to 
100, up to no, up to 120. up to 130, up to 140, up to 150. up to 160. up to 170, up to 
180 up to 190, up to 200. up to 210. up to 220, up to 230. up to 240. up to 250. up to 
260. up to 270, up to 280, up to 290, up to 300. up to 310, up to 320. up to 330, up to 
340 up to 350, up to 360, up to 370, up to 380, up to 390, up to 400. up to 500, up to 
60o' up to 700. up to 800. up to 900. up to 1000. up to 2000, up to 3000, up to 4000, 
up to 5000. up to 6000. up to 7000. up to 8000. up to 9000. up to 10000. up to 12000. 
up to 14000. up to 16000. up to 1 8000 or up to 20000substitutions with the lowest 
"loads" may be eliminated. 

5.1.8 Other Exemplary Rules Based Upon Principal Component Analysis 
of Sequence Alignments 

All of the scores obtained as described in subsections 5.1.4 through 5.1.7 are 
just examples of ways in which such values can be calculated. These values can then 
be combined in one of the ways described in Section 5.1. One skiUed in the art wiU 
readUy appreciate that there are many variations on methods for obtaining quantitative 
measures of the predicted fitness of a substitution in a antibody in such a way tiiat 
these values may subsequenUy be combined. All such variations are included as 

20 aspects of the inventioiL 

By combining the scores obtained from the rules used in metiiods 132 of 
expert system 100, a set of substitutions can be identified for testing. These may be 
the substitutions witi. the highest aggregate scores, they may be tire substitutions with 
tixe highest score for each individual rule 120, or they may be derived in some otiier 
25 way using the scores produced by tiie rules 120 used by methods 130 of expert system 
100. In some embodiments, the number of substitutions selected by step 03 of Fig. 2 
in one cycle of the optimization process is less tiian-lOOO substitutions, more 
preferably less than 250 substitutions, more preferably less tiian 100 substitutions and 
more preferably less than 50 substitutions. 

30 

5.2 DESIGN OF AN ANTIBODY VARIANT SET 

The rules discussed in Section 5.1 above and shown in Fig. 3 are one example 
of tiie way in which an initial sequence space can be defined. The sequence space is 
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defined in terms of an initial target antibody sequence, and substitutions to be made in 
that target sequence. Each substitution is defined in terms of a position in the target 
antibody, and the identity of a monomer wifli which tiie monomer at that position in 
the target antibody is to be replaced. Selection of the target antibody corresponds to 
step 01 in Fig. 2. Defiiution of the sequence space corresponds to step 02 in Fig. 2. 
This section is directed to step 03 of Fig. 2. 

Once an initial set of substitutions has been selected in accordance witii 
Section 5.1, a set of variants incorporatiuag these changes can be designed (die 
designed antibody variant set). This process corresponds to step 03 in Fig. 2. In 
preferred embodiments, tiiis designed antibody variant set includes only a subset of 
tiie total number of possible variants tiiat could be generated. For example, the total 
number of possible variant proteins in a sequence space defined by a target antibody 
containing all possible combinations of 24 substitutions is 2^* > 16,000,000. However 
die metiiods of die present invention allow tiie interrogation of tiiis sequence space by 
designing and syntiiesizing only a very small fi:action of die total number of 
antibodies tiiat are included in the sequence space defined by tiie initial target 
antibody and the substitutions. In some embodiments, tiie number of variants in tiie 
designed antibody variant set is less tfian 1000 variants, more preferably less flian 250 
variants and more preferably less tiian 100 variants. This is possible because, 
although the designed antibody variant set includes only a subset of tiie total number 
of possible variants {e.g. tiie possible combinations of substitiitions), care is taken to 
test all antibody substitutions in many different sequence contexts. An example is 
shown in Fig. 14, where a set of 24 variants were designed to interrogate the sequence 
space defmed by a target antibody sequence and 24 substitutions in Fig. 13. Here, 
each variant contains six substitutions, each substitution occurs sbt times witiiin the 
designed antibody variant set, and each occurrence of each substitution takes place 
vritiun a quite different context, tiiat is it is combined witii a different set of otiier 
substitutions each time. 

The aim when designing a set of antibody variants to interrogate a sequence 
space defined by a target antibody sequence and a set of substitutions is to obtain a 
designed antibody variant set where tiie substitutions are distributed in such a way 
that a large amount of information can subsequentiy be extracted firom sequence- 
activity relationships. In tins respect tiie design of antibody variant sets has common 
elements witii tiie design of experimental datasets firom a diverse range of otiier 
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disciplines including agriculture and engineering. Methods to optimize experimental 
datasets (experimental design or design of experiment: DOE) are described by SirR. 
A. Fisher in 1920 (Fisher, The Design of Experiments, MacMillan Publishing 
Company; 9th edition, 1971). Plackett and Burman developed the idea further with 
the introduction of screening designs (e.g., Plackett et al, 1946, Biometrika 33: 305- 
325), and Taguchi subsequently inti-oduced tiie orthogonal matrix (Taguchi, 1986, 
Introduction to Quality Engineering, Asian Productivity Organization, Distiibuted by 
American Supplier Institiite Inc., Dearborn, M). Any number of experimental 
design techniques can be used to maximize the information content of the designed 
) antibody variant set including, but not limited to. complete factorial design, 2*' 

factorial design, 2" fractional factorial design, central composite, latin squares, greco- 
latin squares. Plackett-Burmami designs. Taguchi design, and combinations thereof 
See, for example. Box et al, 1978, Statistics for Experimenters. New York. Wiley, for 
examples of such techniques that can be used to construct a designed antibody variant 
5 set from tiie initial set of antibody substitiitions selected in accordance witii Section 
5.1 that tests a maximum number of combinations in a minimal number of antibody 
variants. 

The metiiods described above were designed to maximize the amount of 
information tiiat could be obtained from a specified limited number of experiments 
20 tiiat could be performed. This is conceptiially comparable to the resource limitation 
seen in antibody optimization, where functional tests are complex and time, cost or 
otiier resource-limited. However, a significant difference between antibody 
optimization and otiier applications of experimental design is that for antibody 
optimization there is an additional constraint. In designing antibody variants, the 

25 simultaneous intioduction of many changes can adversely affect functional properties 
of tiie antibody. In contrast to traditional experimental design- strategies, it is 
advantageous in tiie present invention to reduce tiie number of previously untested 
substiUitions present in each variant to ten or less, preferably to five or less, more 
preferably to between 3 and 10. For instance, in some specific embodiments, the 

30 number of previously untested substitoitions present in each variant is 10, 9, 8. 7, 6, 5, 
4 or 3. In otiier words, in subsequent cycles of steps 02 tiurough 07, less tiian 10, 9, 8. 
7, 6, 5. 4 or 3 new variants are chosen. Here, a variant references to an antibody that 
has a sequence tiiat is identical to tiie sequence of flie antibody selected in step 01 of 
Fig. 2 witii ttie exception tiiat there are one or more substitutions in tiie sequence. 
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Here, a substitirtion refers to a mutation at a particular position in the antibody from 
the residue found at that position in the antibody selected in step 01 of Fig. 2 to some 
other residue. 

To design an antibody variant set that will yield useful sequence-activity 
5 information upon analysis of the functional properties and sequences of the antibody 
variants, any method can be appropriate provided that the number of substitutions in 
each variant set is relatively small so that the majority of antibodies are active. For 
instance, in preferred embodiments, the number of previously untested substitutions 
present in each variant is preferably 9, 8, 7, 6, 5, 4, 3 or 2. Furthermore, it is desirable 
1 0 that each selected substitution be tried an approximately equal number of times in the 
designed antibody variant set. It is further desirable that each substitution be tested in 
many different sequence contexts. In other words each substitution appears in a 
number of different antibody variants, in each case being combined with a different 
set of other substitutions. In Fig. 14 the substitution L180I appears in variant 3 with 
15 P97S, E138A, Y194S, A236V, V267I and in variant 18 withN95C, S107D, V167I, 
G293A. 131 OK. 

A variation of the above method is to require (i) that each substitution 
identified be tried an approximately equal number of times in the designed antibody 
variant set, and (ii) that as many different combinations of two substitutions (eg. 

20 substitiition pairs) as possible be tested. For example, to test forty Substitutions in an 
antibody it may be desirable to incorporate a maximum of five changes per variant. 
For forty substitutions there are (40 x 39/2) 780 possible pairs of substitiitions. In one 
variant with five substitutions tiiere are ten pairs of substitutions. So in forty variants 
there will be a maximum of 400 substitution pairs. The aim is tiien to maximize tiie 

25 number of different substitution pairs tiiat are tested and to ty to represent each 

substitution five times. The substitution pairs can be scored with the initial selection 
algoritimi, and tiie top scoring 400 substitution pairs tested. The solution to such a 
problem of finding variants witii the constraints mentioned here is known as a 
coverage problem. The coverage problem is NP-hard. Therefore greedy and otiier 

30 forms of approximate solutions are used to solve tiie NP-hard problems in tiie present 
invention. For instance, in some embodiments, tiie algoritiuns described in Gandhi et 
ai, 2001, Lecture Notes in Computer Science 2076: 225 are used. In some 
embodiments, tiie desired set of sequences can be evolved using monte cario 
algoritiims and genetic algoritiuns to maximize tiie number of pairs in tiie variant set. 
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Genetic algorithms are described in Section 7.5.1 of Duda et al, 2001, Pattern 
Classification, Second Edition. John Wiley & Sons. Inc.. New York, which is hereby 
incorporated by reference in its entirety. Further, similar algorithms can be used to 
expand the coverage problem to maximize the number of triplets, quadruplets and so 



on. 
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An exemplary code for maximizing the substitution pairs using an 
evolutionary coverage algorithm is shown below: 

Let m be the number of identified substitutions, n be the number of variants to 
be synthesized, and ik be the number of substitutions per variant. 

Create n initial variants with k substitutions, each occurring n^klm times 
among variants. This can be done randomly or sequentially. This set is not optimal. 

Then, 

for 10000 iterations 
{ 

i. Choose two random variants; 

ii. Choose two random positions; 

iii. Count the number of distinct substitution pairs seen among 
variants; 

iv. Swap the substitutions (if any) at the two positions between 
2Q the two chosen variants; 

V. Check if the number of substitutions per variant is *; 

vi. Check if number of times a given substitution occurs among 
all variants equals nxk/m; 

vii. Count the number of distinct substitution pairs seen among 

-55 variants; 

viii. If the count from vii) is greater than count from iii) and v) 
and vi) are tme. accept the changes to the variants from step 
iv), else, dismiss the changes and retain original values. 

} 

30 Alternatively, the set of substitutions can be divided into two or more groups 

and be used to design variants where each variant contains substitutions from a 
particular group, for example by dividing the antibody into fimctional domains such 
as different complementarity determining regions (CDRs) or framework regions. The 
substitutions in such a variant can be subject the coverage algorithms with constraints 
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described above. Each group can also be combined with other groups of substitutions 
to design initial variants and coverage algorithm can be applied to combination of 
substitution groups. Groups of substitutions can be arrived at using icnowledge of 
antibody domain and /or functional and structural properties of amino acid residues in 
5 the antibody. For example we can identify aU substitutions based on Section 5. 1 and 
select top scoring ones, and classify them into groups of substitutions based on which 
domain of the antibody they are present in such as different complementarity 
determining regions (CDRs) or framework regions. Alternatively, we can also classify 
substitutions based on the their special location in the protein structure (e.g surface 
10 position versus interior positions) based on experimentally determined structure or 
using prediction algorithms. Alternatively, substitutions can be classified based on 
their proximity to the bindiiig sites {e.g residues < 5A from the binding site belong to 
one class and residues > 5 A from the bindmg site to another). Constraints to number 
of substitutions to be designed in a variant from each substitution group can also be 
1 5 added (e.g., no more than two variants from each substitution group). For example, 
two substitutions can be chosen from the group close to the binding site and three 
from the group on the surface of the antibody. Such methods differ from typical 
experimental design or design of experiment (DOE) methods in the fact that no more 
than five changes in a variant are allowed and the occurrence of the selected pairs is 
20 maximized by scoring. Other DOE methods for distributing 40 substitutions would 
require as many as between 18 and 22 changes in an antibody, which would have a 
high likelihood of being detrimental to antibody function. 

Alternatively or additionally, an antibody variant set can be created 
stochastically by library synthesis methods such as parallel site-directed mutagenesis, 
25 DNA shuffling or other methods for incorporating defined substitutions into an 
antibody such as those described in Section 5.8. In these instances the variant set 
contains substitutions distributed at random, so precisely defmed variants are not 
syntiiesized. Instead, the introduction of substitutions is controlled so that the average 
number of substitutions incorporated into each variant is between 1 and 10, more 
3 0 preferably the average number of substitutions incorporated into each variant is 
between 1 and 5. Variants can then be selected at random and the distribution of 
substitutions can be determined by determining the sequence of the antibody. In some 
embodiments of the mvention less than 1000 variants created by library syntiiesis 
methods are syntiiesized and sequenced, preferably less than 500 variants created by 
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library synthesis methods are synthesized and sequenced, more preferably less than 
250 variants created by library synthesis methods are synthesized and sequenced, 
even more preferably less than 100 variants created by library synthesis methods are 
synthesized and sequenced. In some embodiments that use libraries, the creation of a 
5 library can be simulated using computational modeling of shuffling and other 

methods. See. for example, Moore, 2001. Proc Natl Acad Sci USA 13, 3226-3231; 
Moore and Maranas, 2000, J Theor Biol. 205. pp. 483-503. 

Once the antibody variant set has been designed, the variants are synthesized 
using methods known in the art. Representative, but nonlimiting synthetic methods 
10 are described in Section 5.8. below. TTien the antibodies are tested for relevant 

biological properties. Such relevant biological properties include, but are not limited 
to antibody solubiUty and activity. NonUmiting examples of how such antibody 
activity can be tested are described in Section 5.9 below. Togetiier the synthesis and 
testing of the antibody variants represent step 04 in Fig. 2. 
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53 METHODS FOR MAPPING A SEQUENCE SPACE TO A FUNCTION 

SPACE 

Once substitutions have been selected using expert system 100 (Fig. 2. step 
20 04), and variants have been designed, syntiiesized and tested for one or more activity 
or function, it is desirable to use the sequence and activity information from the 
designed antibody variant set to assess the contributions of substitutions to tiie one or 
more antibody activity or function. This process is represented as step 05 in Fig. 2. 
Assessment of the contributions of substitutions to one or more antibody function can 
25 be performed by deriving a sequence-activity relationship. Such a relationship can be 
expressed very generally, for example as shown in Equation 3 

(Eq3) Y = f(xi.'X2, Xi) 

where, 

30 Y is a quantitative measure of a property of the antibody (e.g.. activity). 

Xi is a descriptor of a substitiition, a combination of substitutions, or a 
component of one or more substitutions in tiie sequence of tiie antibody, and 
f() is a matiiematical fimction that can take several forms. 
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A model of tiie sequence-activity relationship can be described as a 
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form whose parameters have been trained for the input data (Y and Xi). Protein 
sequences can be mathematically represented in terms of many variables (descriptors, 
predictors), each variable representing the type of amino acid at a specific location 
(linear form in terms of the position of the amino acid). For example, the sequence 
AGWRY can be represented by five variables, v^ere variable one assumes a value of 
"A" corresponding to position 1, variable two is "G" conesponding to position two 
and so on. Each variable can assume 1 of 20 possibilities. Alternatively, each variable 
can also represent multiple positions (say two) and assume 1 of 400 values (for 2 
positions) corresponding to 20x20 = 400 combination of possible amino acid pairs. 
For example, a variable can describe position one and two and assume a value of 
"AG" (thereby creating a variable that in non-linear in tenns of position of the amino 
acid). Alternatively, each position can assume a value conesponding to a physico- 
chemical property of the amino acid instead of amino acid identity. For example, the 
position can be described in terms of the mass of the amino acid at that location. For 
the sequence AGWRY, a variable for position one can assume the value 71.09 and 
position two 57.052 and so on. Alternatively, each position can be described by one or 
several principal components derived to represent many physico-chemical properties 
of the amino acid present in that position. Alternatively, each variable can be a 
combination of variables representing properties of amino acids. Alternatively, each 
variable can be represented in a binary form corresponding to presence or absence of 
a particular amino acid. For example, consider two variants AGWRY and AKWRY, 
Position two can be " 1 if G is present at that position and "0 if it is absent and tiie 
descriptor for that position can have the value "0 or "1 ." Alternatively, each variable 
can be represented in a binary form corresponding to presence or absence of a defined 

group of amino acids. 

In equation 3, the fimctional form f( ) correlates descriptors of an antibody 
sequence (x.) to its activity. In a simple embodiment of tiie invention, tiie function f 
can be a linear combination of Xj: 

(Eq. 5) Y= W|Xi+ W2X2,+ v^^jXi 
where Wj is a weight (or coefficients of Xj). 

In some embodiments, to derive a sequence-activity relationship, a set of 
descriptors (Xi) tiiat can describe all of tiie substitutions witiiin the antibody variant set 
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is identified. Values of Y for each member of the antibody variant set are measured. 
Values for each weight (wO are then calculated such that the differences between 
values predicted for each value of Y by Equation 3 and those observed experimentaUy 
are minimized for the antibody variants set. or for a selected subset of such antibody 
variants. 

The minimization step above can also use weights for different activity 
predictions and, in general, can use a loss function. In on embodiment this loss 
fianction can be squared error loss, where weights that minimize the sum of squares of 
the differences between predicted and measured values for the dataset are computed. 

In some embodiments statistical regression methods are used to identify 
relationships between dependent (xs) and independent variables (Y). Such techniques 
include, but are not limited to, linear regression, non-linear regression, logistic 
regression, multivariate data analysis, and partial least squares regression. See, for 
example, Hastie, The Elements of Statistical Learning, 2001, Springer, New York; 
Smith, Statistical Reasoning, 1985, AUyn and Bacon. Boston. In one embodiment, 
regression techniques like the PLS (Partial Least Square) can be used to solve for the 
weights (wi) in the equation X. Partial Least Squares (PLS) is a tool for modeling 
linear relationships between descriptors. The method is used to compress the data 
matrix composed of descriptors (variables) of variant sequences being modeled into a 
set of latent variable called factors. The number of latent variable is much smaller 
than the number of variables (descriptors) in the input sequence data. For example, if 
the number of input variable is ICQ, the number oflatent variables can be less than 10. 
The factors are determined using the nonlinear iterative partial least squares 
algorithm. The orthogonal factor scores are used to fit a set of activities to the 
dependent variables. Even when the predictors are highly collinear or linearly 
dependent, the method finds a good model. Alternative PLS algorithms like the 
SIMPLS can also be used for regression. In such methods, the contribution to the 
activities from eveiy variable can be deconvoluted to study tiie effect of sequence on 

the fimction of the antibody. 

In some embodiments, modeling techniques are used to derive sequence- 
activity relationships. Such modeling techniques include linear and non-linear 
approaches. Linear and non-linear approaches are differentiated from each other 
based on the algebraic relationships used between variables and responses in such 
approaches. In tiie system being modeled, the input data (e.g., variables that serve as 
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descriptors of the antibody sequence), in turn, can be linearly related to the variables 
provided or non-linear combinations of the variables. It is therefore possible to 
perform different combinations of models and data-types: linear input variables can 
be incorporated into a linear model, non-linear input variables can be incorporated 
5 into a linear model and non-linear variables can be incorporated into a non-linear 
models. 

Many functional forms of fQ (Eqn. 3) can be used and the functional form can 
be combined using weights defined in the knowledge base 108 for analysis. For 
example. Function fQ can assume non-linear form. An example of non-linear 
10 functional form is: 

Y= W|2 *X|*X2 + Wi23*Xi*X3+ . W„„ ♦x„*X„ 

Non-linear functions can also be derived using modeling techniques such as 

1 5 machine learning methods. For example, the sequence(xi)-activity(Y) data to predict 
the activities of any sequence given the descriptors for a sequence can be determined 
using neural networks, bayesian models, generalized additive models, support vector 
machines, classification using regression trees. 

The data describing variants of the initial antibody can be represented in many 

20 forms. In some embodiments, all or a portion of tiie data is represented in a binary 
format. For example, representing the presence or absence of a specified residue at a 
particular position by a " 1 or a "0 constitiites a linear binary variable. In anotiier 
example, representing the presence of a specified residue at one position AND a 
second specified residue at a second position by a "1 constitiites a non-linear binary 

25 ' variable. In some embodiments, all or a portion of the data is represented as Boolean 
operators. In some embodiments, all or a portion of tiie data is represented as 
principal component descriptors derived from a set of properties. See, for example, 
Sandberg et al., 1998, J Med Chem. 41 , 2481-91 . Antibody input sequence data can 
also use descriptors based on comparison witii a sequence profile {e.g., a hidden 

30 Markov model, or prmcipal component analysis of a set of sequences). For example 
in Fig. 9, PCI and PC2 values of tiie sequences can be used as descriptors for the 
sequences in tiiat Fig.. In addition, any number of principle components can be used 
as descriptors. See, for example. Casari et al., 1995, Nat Struct Biol. 2:171-8; and 
Gogos et al, 2000, Proteins 40:98-105. 
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To initiate step 05 (Fig. 2), the antibody sequence data in the designed set and 
the results of the assays performed on the designed set are converted to a form that 
can be used in pattern classification and/or statistical techniques in order to identify 
relationships between tiie results of the assays and the substitutions present in the 
5 designed set. In general, such conversion involves a step in which independent 
variables and dependent variables are enumerated. Here, tiie independent variables 
are the various substitutions (mutations) tiiat are present in the designed set. The 
dependent variables are the results of assays, such as those described in Section 5.9. 
Each substitiition can be considered independentiy. The presence or absence 
10 of a substitution or residue at a specific position can be used to describe one or more 
of the independent variables. The presence or absence of two or more substitutions or 
residues at two or more specific positions can be used to describe one or more of the 
ii^dependent variables. One or more physico-chemical descriptors of a substitixtion or 
residue at a specific position can be used to describe one or more of the independent 
15 variables. One or more physico-chemical descriptors of two or more substitutions or 
residues at two or more specific positions can be used to describe one or more of the 
independent variables. Then, pattern classification and/or statistical techniques are 
used to identify relationships between particular substitutions, or combinations of 

substitutions, and the assay data. 
20 In some embodiments, supervised learning techniques are used to identify 

relationships between mutations in the designed set and antibody properties identified 
in assays results such as assays performed in Section 5.9. Such supervised learning 
techniques include, but are not limited to, Bayesian modeling, nonparametric 
techniques (e.g., Parzen windows, *„-Nearest-Neighbor algorithms, and fuzzy 
25 classification), neural networks {e.g., hopfield network, multilayer neural networks 
and support vector machines), and machine learning algorithms {e.g., algonthm- 
independent machme learning). See. for example, Duda et al. Pattern Classification, 
2^ edition. 2001, John Wiley & Sons. Inc. New York; and Pearl. Probabilistic 
Reasoning in Intelligent Systems: Net^^orks ofPlatmble Inference. Revised Second 
30 Printing, 1988. Morgan Kaufmami. San Francisco. For example, the sequence (xO - 
activity (Y) data can be sed to predict the activities of any sequence given tiie 
descriptors for a sequence using a neural network. The input for the network is the 
descriptors and the output is the predicted value of Y. The weights and the activation 
function can be trained using supervised decision based learning rules. The learning 
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is performed on a subset of variants called the training set and performance of the 
network is evaluated on a test set 

In some embodiments, unsupervised learning techniques are used to identify 
relationships between mutations in the designed set and antibody properties identified 
5 in assays results such as assays performed in Section 5.9. Such unsupervised learning 
techniques include, but are not limited to stochastic searches (e.g., simulated 
annealing, Boltzmann learning, evolutionary methods, principal component analysis, 
and clustering methods). See, for example, Duda et al. Pattern Classification, 2"'' 
edition, 2001, John Wiley & Sons, Inc. New York. For example, the weights in 
10 equation 5 can be adjusted by using monte carlo and genetic algorithms. The 

optimization of weights for non-linear functions can be complicated and no simple 
analytical method can provide a good solution in closed form. Genetic algorithms 
have been successfully used in search spaces of such magnitude. Genetic algorithms 
and genetic programming techniques can also be used to optimize the function form 
15 to best fit the data. For instance, many recombinations of functional forms applied on 
descriptors of the sequence variants can be applied. 

In some embodiments, boosting techniques are used to construct and/or 
improve models developed using any of the other techniques described herein. A 
model of the sequence-activity relationship can be described as a functional form 
20 whose parameters have been trained for the input data (Y and Xi). Many algorithms / 
techniques to build models have been described. Algorithms applied on a specific 
dataset can be weak in that the predictions can be less accurate or "weak" (yielding 
poor models). Models can be improved using boosting techniques. See, for example, 
Hastie et al. The Elements of Statistical Learning, 2001, Springer, New York. The 
25 purpose of boosting is to combine the outputs of many "weak" predictors into a 

powerful "committee." In one embodiment of the invention, boosting is applied using 
the AdaBoost algorithm. Here, the prediction algorithm is sequentially applied to 
repeatedly modified versions of the data thereby producing a sequence of models. 
The predictions from all of these models are combined through a weighted majority 
30 vote to produce the final prediction. The data modification at each step consists of 
applying weights (Wi) to each of the / training observations. Initially weights are set 
to 1/N, where N is the number of training observation (sequence-activity data). The 
weights are modified individually in each successive iteration. Training observations 
that were predicted poorly by a particular model have their weights increased and 
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training observations that were predicted more accurately have their weights 
decreased. This forces each successive model to concentrate on those training 
observations that are issued by the previous model. The step of combining the models 
to produce a "committee" assigns a weight to each model based on the overall 
5 prediction error of that model.. 

The various modeling techniques and algorithms described herein can be 
adapted to derive relationships between one or more desired properties or functions of 
an antibody and therefore to make multiple predictions from the same model. 
Modeling techniques that have been adapted to derive sequence-activity relationships 
10 for antibodies are within the scope of the present invention. Some of these methods 
derive linear relationships (for example partial least squares projection to latent 
structures) and others derive non-linear relationships (for example neural networks). 
Algorithms that are specialized for mining associations in the data are also useful for 
designing sequences to be used in the next iteration of sequence space exploration. 
15 These modeling techniques can robustiy deal with experimental noise in the activity 
measured for each variant. Often experiments are performed in repUcates and for 
each variant there will be multiple measurement of the same activity. Tliese multiple 
measurements (replicate values) can be averaged and treated as a single number for 
evei7 variant while modeling the sequence-activity relationship. The average can be a 
20 simple mean or another form of an average such as a geometric or a harmonic mean. 
In tiie case of multiple measurements. outUers can be eliminated. In addition, the 
error estimation for a model derived using any algorithm can incorporate tixe multiple 
measurements through calculating the standard deviation of the measurement and 
comparing the predicted activity from tiie model with the average and estimate tiie 
25 confidence interval witiiin which the prediction lies. Weights for observations to be 
used in models can also be derived from the accuracy of measurement, for example, 
through estimating standard deviation and confidence intervals. This procedure can 
put less emphasis on variants whose measurements are not accurate. Alternatively, 
theses repUcate value can be treated independentiy. This will result in duplicating the 
30 sequences in tiie dataset. For example, if sequence variant i represented by descriptor 
values {Xj}" has been measured in triplicates(Yn. Yi2. Ya), the training set for 
modeling will include descriptor value {xj}°with activity Y. and {xj}'^ witi. activity 
Yi3 in addition to {xj}" with activity Yj, where {Xj}"= {Xj>'^= {xj}' . 

A representative modeling routine in accordance with one embodiment of tiie 
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invention comprises the following steps. 



Step 302. Relevant descriptors of the monomeric variables are identified. 
These descriptors can convey physico-chemical properties relevant to the interaction 
5 between biomolecules or classify the monomers (residues) as discreet entities 

represented in binary form as described earlier. The former is preferred for residue 
positions in the antibody sequence where the number of different amino acid 
substitutions is four or more or where the variables can assume one of four possible 
values for those positions and the physico-chemical properties values are weU 
1 0 distributed (e.g.) different fi-om each other. The latter is preferred for positions that 
have four or less possible values for the relevant variable, and/or the values are 
clustered (e.g.) are not very different from each other. To create non-lmear variables, 
new variables are formed that are combination of monomeric variables. For example, 
consider two variants AGWRY and AKYRY. The linear binary form of tiie variable 
15 (descriptor) for position 2 assumes a value of "1 if G is present at that position and 
"0 if it is absent. Alternatively, a non-linear variable can be created in addition to tiie 
linear variables describing each position. In tiie above example, a new non-linear 
variable representing position "2 and "3 can assume four values in numeric form. 
In one form, tiie variable can assume a value of 1 1 for «GW", 10 for "GY". 01 for 
20 "KW'and 00for"KY". In otiier representations of binary non-linear variable, four 
variables can describe position 2 and 3, where variable one assumes a value of "1 if 
the sequence at position 2 and 3 is "GW" and "0 otiierwise and tiie second variable 
takes tiie values of "1 or"0 if tiie sequence is "GY" or otiierwise and so on. 
In some embodiments it is advantageous to identify regions and tiiereby 
25 variables based on factors including, but not limited to, structiires, domains, motifs 
and exons, optionally using expert system 100 to do so, in order to weigh different 
variables and tiieir contribution to the model or to buUd sequence activity models 
based on tiiese fectors. For example, a weight of "1 can be assigned to variables in 
tiie heavy chain of tiie antibody and "0 for variables in light chain of the antibody 
30 when modeling activity Y, and a weight of "0 can be assigned to variables in heavy 
chain and "1 for light chain when modeling activity Yj. This weighting can also 
incorporate constraints such as immunogenicity and otiier functional considerations 
that may or may not be measured in experiments, but which can be fiilly or partially 
predicted using computational techniques. For example, a negative weight can be 
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gned to appearance of a T-cell epitope in a variant, or removal of glycosylation 



Step 304. In step 304 the parameters for the functional form of the 
5 sequence-activity relationship are optimized to obtain a model by minimizing the 
difference between the predicted values and real values of the activity of the antibody. 
Such optimization adjusts tiie individual weights for each descriptors identified in 
preceding steps using a refinement algorithm such as least squares regression 
techniques. Otfier methods that use alternative loss functions for minimization can be 
10 used to analyze any particular dataset. For example, in some antibody sequence- 
activity data sets, the activities may not be distributed evenly tiiroughout the measured 
range. Hiis will skew the model towards data points in the activity space tiiat are 
clustered. THs can be dUadvantageous because datasets often contain more data for 
antibody variants with low levels of activity, so tiie model or map will be biased 
15 towardsaccuracyfortheseantibodiesthatareoflowerinterest. Tlusskewed 

distribution can be compensated for by modeling using a probability fector or a cost 
ftmction based on expert knowledge. Tbis function can be modeled for the activity 
value or can be used to assign weights to data points based on their activity. As an 
example, for a set of activities in the range of 0 to 10. transforming the data with a 
20 sigmoidal function centered at five will give more weight to sequences with activity 
above five. Such a fimction can optionaUy also be altered with subsequent iterations, 
thereby focusing the modeling on the part of the dataset with the most desired 
functional characteristics. This approach can also be coupled with exploring 
techniques like a Tabu search, where undesired space is explored with lower 
25 probabilities. 

In some embodiments, algorithms that optimize tiie sequence-activity model 
for the dataset by randomly starting with a solution (e.g. . randomly assigning weights 
Wi) and usmg methods like hill-descent and/or monte-carlo and/or genetic algorithm 
approaches to identify optimal solutions. 

In embodiments directed to antibody engineering, robustness of the models 
used is a significant criterion. Thus, obtaining several sub-optimal solutions from 
various initial conditions and looking at all the models for common features can be a 
desirable metiiodology for ensuring the robustness of the solution. Anotiier way to 
obtain robust solutions is to create bootstrap data sets based on tiie input data, than 
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estimate a p-value or confidence on the various coefficients of the model. In addition 
boosting techniques like AdaBoost can be used to obtain a "committee" based 
solution. 

Step 306. Many mathematical modeling techniques for deriving a sequence- 
activity correlation are evaluated. Preferred mathematical modeling techniques used 
to identify and capture the sequence-activity correlation handle (i) very large numbers 
of variables (e.g 20 or more) and correlations between variables, (ii) linear and non- 
linear interactions between variables, and (iii) are able to extract the variables 
responsible for a given functional perturbation for subsequent testing of the 
mathematical model (e.g., models should be easily de-convoluted to assign the effect 
of variables describing the amino acids substitution with activities). 

Step 308. In step 308 the coefficients (parameters) of the model(s) are 
deconvoluted to see which amino acid substitutions (variables/descriptors of the 
variants) influence tiie activity of the antibody. It can be unportant to identify which 
descriptor of tiie antibody are important for the activity of interest. Some of the 
techniques, such as partial least squares regression (SIMPLS) that uses projection to 
latent structures (compression of data matrix into orthogonal fectors) may be good at 
directly addressing this point because contributions of variables to any particular 
latent factors can be directly calculated. See, for example. Bucht et fl/..1999. Biochim 
Biophys Acta. 1431 :471-82; and Norinder et al., 1997, J Pept Res 49:155-62. Other 
methods such as neural networks can learn from the data very well and make 
predictions about tiie activity of entire antibodies, but it may be difficult to extract 
information, such as individual contributing features of the antibody firom the model. 
Modeling techniques/metiiods that directly correlate the amino acid variations to tiie 
activity are preferred because we can derive tiie sequence-activity map (relationship) 
to construct new variants not in dataset that have preferentially higher activities. 
These metiiods can be adapted to provide a direct answer and output in desired forms. 

Step 310. In step 310 tiie models developed using various algorithms and 
metiiods in ttie previous step can be evaluated by cross validation metiiods. For 
example, by randomly leaving data out to build a model and making predictions of 
data not mcorporated into tiie model is a standard technique for cross validation. In 
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some instances of antibody engineering, data may be generated over a period of 
months. The data can be added incrementally to the modeling procedure as and when 
such data becomes available. This can allow for validation of the model with partial 
or additional datasets, as well as predictions for the properties of antibody sequences 
for which activities are stiU not available. This information may then be used to 
validate the model. 

An example of internal model validation methods is shown in Figs 4 and 5. In 
these schemes a confidence score for each regression coefficient or weight vector can 
be generated for any antibody sequence-activity model. 

For example, in one embodiment of the present invention, average values for 
weight fonctions can be obtained by omitting a part of the available data. Eitiier 
individual sequences and tiieir associated activities or individual substihition positions 
can be left out. A sequence-activity relationship can flien be constructed from this 
partial data. This process can be repeated many times, each time tiie data to leave out 
is selected randomly. Finally an average and range of values for each weight fimction 
is calculated. The weight fimctions can tiien also be ranked in order of tiieir 

importance to activity. 

To assess the probability tiiat a substitiition is associated witij an activity by 
random chance, tiie same weight function calculations can be performed when tiie 
sequences and activities are randomly associated (Fig. 5). In fliis case tiiere should be 
no relationship between sequence and fimction, so weight functions arise only by 
chance. A measure of tiie confidence for tiie weight function can tiien be calculated. 
It is related to the number of standard deviations by which tiie value calculated when 
sequences and activities are correctly associated exceeds the value calculated when 
tiiey are randomly associated. The above metiiods on model assessment, model 
inference and averaging are discussed m detail by Hastie et al, 2001, Springer 
Verlag, series in statistics. 

Step 572. In step 312 new antibody sequences tiiat are predicted to possess 
one or more desired property are derived. Alternatively it can be desirable to rank 
order the input variables for detailed sequence-activity correlation measures. The 
model can be used to propose sequences that have high probabilities of being 
improved. Such sequences can tiien be syntiiesized and tested. In one embodiment, 
tius can be achieved if tiie effects of various sequence featiires of tiie antibodies on 
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their functions aie known based on the modeling. Alternatively, for methods like 
neural networks, 10^ or 10** or lO' or lO'^ or lO'* or lO" or as many as 10*° sequences 
can be evaluated in silico, Then those predicted by the model to possess one or more 
desired properties are selected. 

5 

Step 314. The statistical quality of the model fit to the input data is evaluated 
in step 314. Validation of sequence-activity correlation can be internal, using cross- 
validation of the data, or preferably external, by forecasting the functional 
perturbation of a set of new sequences derived from the model. Sequences with 

1 0 predicted values of their ftmctional perturbations are then physically made and tested 
in the same experimental system used to quantify the training set. If the sequence- 
activity relationship of the dataset is satisfactory quantified using internal and external 
validation, the model can be applied to a) predict the functional value of other related 
sequences not present in the training set, and b) design new sequences within the 

1 5 described space that are likely to have a function value that is outside or within the 
range of function given by the training set. 

The initial set of data can be small, so models built from it can be inaccurate. 
Initial models may not contain terms to account for amino acid interactions. Others 

ffave found that amino acid changes within an antibody are approximately additive 

20 and few interaction terms are required to describe the effects of mutations on protein 
function. See, for example, Aita et al. (2000) Biopolymers 54; 64-79.; Aita et al 
(2001) Protein Eng 14: 633-8.; Choulier et al. (2002) Protein Eng 15: 373-82.; and 
Prusis et al. (2002) Protein Eng 1 5 : 305- 1 1 . However such interactions can be 
important and can result in a variant that incorporates all beneficial changes having 

25 low activity (Aita et al. (2002) Antibodies 64: 95- 1 05 .). Improving the modeled 
relationship further depends upon obtaining better values for weights whose 
confidence scores are low. To obtain this data, additional variants designed as 
described in Section 5.4 below will provide additional data useful in establishing 
more precise sequence-activity relationships. 

30 

The output from each method for modeling a sequence-activity relationship 
can be one or more of: (i) a regression coefficient, weight or other value describing 
the relative or absolute contribution of each substitution or combination of 
substitutions to one or more activity of the antibody, (ii) a standard deviation, 
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variance or other measure of the confidence with which the value describing the 
contribution of the substitution or combination of substitutions to one or more activity 
of the antibody can be assigned, (iii) a rank order of preferred substitutions, (iv) the 
additive & non-additive components of each substitution or combination of 
substitutions. (V) a mathematical model that can be used for analysis and prediction of 
the functions of m silico generated sequences, (vi) a modification of one or more 
inputs or weights used by an expert system 100 to select substitutions or (vii) a 
modification of the methods used by expert system 100 to design an antibody variant 



10 



15 



25 



set. 



5.3.1 Methods for Combining the Results from Two or More Sequence- 
activity Relationship Modeling Methods, 

It will be appreciated by one skilled m the art that each different method for 
deriving relationships between antibody sequences and activities can differ in the 
precise values of their outputs. In some embodiments of the invention it is therefore 
desirable to combine the outputs from two or more such methods for subsequent uses. 
This corresponds to step 06 in Fig. 2. There are a variety of ways in which such 
outputs can be combined. In some^mbodiments, each output can be independently^ 
applied to the subsequent desigA'ofantibodywiaiRs (Fig. 2. step 07) or the 
20 modification of parameters or weights used by expert system 100 for the selection of 
substitutions (Fig. 2 step 02) or the design of antibody variant sets (Fig. 2 step 03). In 
some embodiments, average values (or some other mathematical fimction of two or 
more values derived by two or more sequence-activity models) can be calculated for 
the regression coefficient, weight or other value describing the relative or absolute 
contribution of each substitution or combination of substitutions to one or more 
activity of the antibody (e.g., as defined in Equation 4 below). In some embodiments, 
the standard deviation, variance or other measure of the confidence with which the 
value describing the contribution of the substitution or combmation of substitutions to 
one or more activity of the antibody can be assigned (e.g.. as defined in Equation 4 
30 below). In some embodiments, the rank order of preferred substitutions is used to 

combine the methods. In some embodiments, die additive (linear variables) and non- 
additive components (non-linear variables) of each substitiition or combination of 
substitiations is combined: 
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(Eq. 6) Vix = f(M,(ix). MzCix). . Mj(ix)) 

where, 

5 Vix is a combined measure of one of the descriptors measuring the 

performance of an antibody in which monomer x is substituted at position i; 

Mj(ix) is a measure of one of descriptors measuring the performance of 
an antibody in which monomer x is substituted at position i, determmed by 
sequence-activity correlating method j(Mj(ix) is the contribution of ix as 
10 determined by Model j) 

fQ is some mathematical function. 
In one embodiment fQ can be a linear combmation of contribution of i, from 
many models. 

In some embodiments, independent values can be obtained for the functional 
1 5 values of in silico generated sequences derived from two or more mathematical 

models by using the model generated in the prior steps to predict/calculate the value 
of the new sequence represented in terms of the same variables that are used to build 
the model. In some embodiments, average values (or some other mathematical 
function of two or more values derived by two or more sequence-activity models) can 
20 be obtained for the functional values of in silico generated sequences derived from 
two or more mathematical models. 

The methods used to derive sequence-activity relationships can be chosen or 
modified such that they better predict the performance of individual substitutions 
within a combination of other substitutions in an antibody, as described in more detail 
25 in Subsection 5.4.4. 

5.4 USE OF SEQUENCE-ACTIVITY RELATIONSHIPS TO DESIGN 
OPTIMIZED VARIANTS OR ADDITIONAL VARIANT SETS 

30 There are many ways to use the results of sequence-activity correlations 

described in Section 5.3 in the design of a subsequent set of variants. This 
corresponds to step 07 of Fig. 2. Conceptually, this step is similar to the processes 
corresponding to steps 02 and 03 in Fig. 2. It involves defining a sequence space in 
terms of an antibody sequence and a set of substitutions, then designing a set of 
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antibody variants that incorporate those substitutions in different combinations. 

5.4.1 Definition of the Sequence Space to Represent Additional Variant 
Sets 

A fevr methods for defining a sequence space for an optimized variant or 
additional variant set. using an antibody sequence and a set of substitutions are 
enumerated here by way of examples not intended to limit tiie scope of the present 
invention. 

In one embodiment the sequence space can be defined in terms of tiie original 
target antibody sequence and substitutions that have already been tested. In preferred 
embodhnents of the invention, tiiis method for defining tiie sequence space is used if 
the desired degree of fiirther increase in one or more activity of die antibody is less 
than 10-fold, preferably less than 5-fold, more preferably less tiian 2-fold. 

In another embodiment, die sequence space can be defined in terms of tiie 
original target antibody sequence and a combination of substitutions tiiat have already 
been tested and tiiose tiiat have not yet been tested. In preferred embodiments of tiie 
invention, tiiis method for defining tiie sequence space is used if tiie desired degree of 
fiirtiier increase in one or more activity of the antibody is greater tiian 2.fold, 
preferably greater than 5-fold, and more preferably greater tiian lO-fold. 

In still anotiier embodiment, tiie sequence space can be defined purely in terms 
of tiie original target antibody sequence and substitutions tiiat have not yet been 
tested. This metiiod for defining tiie sequence space is generally most appropriate for 
tiie initial variant set as represented in Fig. 2 step 02. 

5 4 2 Assessment of Previously Tested Substitutions for Incorporation into 
25 Optimized Variants or Additional Variant Sets 

The metiiods for selecting substitutions tiiat have not previously been tested 
have been described in Section 5.1. Metiiods for selecting or eliminating substitutions 
that have previously been tested use one or more of tiie outputs firom tiie methods for 
correlating antibody sequences witii tiieir activities. A few metiiods for defining a 
sequence space for an optimized variant or additional variant set. using an antibody 
sequence and a set of substitutions are enumerated here by way of examples. In tiie 
following examples, tiie term "substitution" can also mean a pair or larger group of 



20 



30 



64 



c. c 

wo 2005/012877 PCTAJS2004/024751 

substitutions (for example, when the descriptors of antibodies are represented in non- 
linear form as described in section 5.3), since sequence-activity relationships can 
produce regression coefiScients, weights or other measurements of contribution to 
function and confidences for these measurements that apply not to individual 
5 substitutions but to specific combinations of these substitutions. 

(i) A substitution can be selected if it has a positive regression coefficient, 
weight or other value describmg its relative or absolute contribution to one or more 
activity of the antibody. 

(ii) A substitution can be selected if it has a positive regression coefficient, 
1 0 weight or other value describing its relative or absolute contribution to one or more 

activity of the antibody, and it is at least one standard deviation, preferably two 
standard deviations or more preferably three standard deviations above neutrality. 

(iii) A substitution can be selected if it has a positive regression coefficient, 
weight or other value describing its relative or absolute contribution to one or more 

1 5 activity of the antibody, and it has also been tested at least once, preferably at least 
twice, more preferably at least 3 times, more preferably at least 4 times, even more 
preferably at least 5 times. 

(iv) A substitution can be selected if it has a positive regression coefBcient, 
weight or other value describing its relative or absolute contribution to one or more 

20 activity of the antibody, and it is at least one standard deviation, preferably two 

standard deviations or more preferably three standard deviations above neutrality, and 
it has also been tested at least once, preferably at least twice, more preferably at least 
3 times, more preferably at least 4 times, even more preferably at least 5 times. 

(v) A substitution can be selected from a rank ordered list of substitutions. For 
25 example the most favorable substitution may be selected, or the 2, 3, 4, 5, 6, 7, 8, 9, 

10, 11, 12, 13, 14, 15, 16, 17, 18, 19 or 20 most favorable substitutions can be 
selected. 

(vi) A substitution can be selected from a rank ordered list of substitutions. 
For example, the most favorable substitution can be selected, or the 2, 3, 4, 5, 6, 7, 8, 

30 9, 1 0, 1 1 , 1 2, 1 3, 1 4, 1 5, 1 6, 1 7, 1 8, 1 9 or 20 most favorable substitutions can be 
selected, and it is at least one standard deviation, preferably two standard deviations 
or more preferably three standard deviations above neutrahty. 

(vii) A substitution can be selected from a rank ordered list of substitutions. 
For example, the most favorable substitution can be selected, or the 2, 3, 4, 5, 6, 7, 8, 
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9, 10, 1 1. 12. 13, 14, 15, 16, 17, 18, 19 or 20 most favorable substitutions can be 
selected, and it has also been tested at least once, preferably at least twice, more 
preferably at least 3 times, more preferably at least 4 times, even more preferably at 
least 5 times. 

(viii) A substitution can be selected from a rank ordered list of substitutions. 
For example, the most favorable substitution may be selected, or the 2, 3, 4, 5, 6, 7, 8, 
9, 10. 1 1, 12, 13, 14, 15, 16, 17, 18, 19 or 20 most favorable substitutions can be 
selected, and it is at least one standard deviation, preferably two standard deviations 
or more preferably three standard deviations above neutrality, and it has also been 
tested at least once, preferably at least twice, more preferably at least 3 times, more 
preferably at least 4 times, even more preferably at least 5 times. 

(ix) A substitution can be selected if it has a negative regression coefficient, 
weight or other value describing its relative or absolute contribution to one or more 
activity of the antibody, and it is less tiian three standard deviations, preferably less 
than two standard deviations or more preferably less tiian one standard deviation 
below neutrality. 

(X) A substitution can be selected if it has a negative regression coefBcient, 
weight or other value describing its relative or absolute contiibution to one or more 
activity of the antibody, and it has also been tested no more tiian 5 times, preferably 
no more than 4 times, more preferably no more than 3 times, more preferably no more 
tiian twice, even more preferably no more flian once. 

(xi) A substitiition can be selected if it has a negative regression coefficient, 
weight or otiier value describing its relative or absolute contiibution to one or more 
activity of the antibody, and it is less tiian tiiree standard deviations, preferably less 
tiian two standard deviations or more preferably less tiian one standard deviation 
below neutrality, and it has also been tested no more tiian 5 times, preferably no more 
tiian 4 times, more preferably no more tiian 3 times, more preferably no more tiian 
twice, even more preferably no more tiian once. 

(xii) A substitiition can be eliminated if it has a negative regression 
coefficient, weight or otiier value describing its relative or absolute contiibution to 
one or more activity of the antibody. 

(xiii) A substitiition can be eliminated if it has a negative regression 
coefficient, weight or otiier value describing its relative or absolute contiibution to 
one or more activity of tiie antibody, and it is at least one standard deviation, 
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preferably two standard deviations or more preferably three standard deviations above 
neutrality. 

(xiv) A substitution can be eliminated if it has a negative regression 
coefficient, weight or other value describing its relative or absolute contribution to 

5 one or more activity of the antibody, and it has also been tested at least once, 

preferably at least twice, more preferably at least 3 times, more preferably at least 4 
times, even more preferably at least 5 times. 

(xv) A substitution can be eliminated if it has a negative regression coefficient, 
weight or ottier value describing its relative or absolute contribution to one or more 

10 activity ofthe antibody, and it is at least one standard deviation, preferably two 

standard deviations or more preferably tiiree standard deviations above neutrality, and 
it has also been tested at least once, preferably at least twice, more preferably at least 
3 times, more preferably at least 4 times, even more preferably at least 5 times. 

5.4.3 Methods for Designing Antibody Variant Sets Incorporating 
\ 5 Previously Tested Substitutions 

Antibody variants tiiat combine or eliminate previously tested substitiitions 
can serve at least two purposes. First, they can be used to obtain antibody variants 
tiiat are unproved for one or more property, activity or function of interest. GeneraUy, 

20 though not exclusively, substitutions selected according to criteria (i) - (viii) in 

subsection 5.4.2 are most likely to be appropriate for tiiis purpose. Second, tiiey can 
be used to obtain additional information relating the sequence to the activity of an 
antibody, thereby improving the accuracy witii which predictions can be made 
concerning tiie effect of substitiitions upon one or more property, activity or function 

25 of an antibody. Generally, though not exclusively, substitutions selected according to 
criteria (i) - (xi) in subsection 5.4.2 are most likely to be appropriate for this purpose. 

The following methods can be used to design antibody variants containing 
combmations of substitutions selected by one or more of tiie metiiods described in 
subsection 5.4.2. 



30 



Method 1. An antibody tiiat has previously been tested for tiie one or more 
property, activity or fimction of interest is selected. In preferred embodiments tiie 
selected antibody has one of tiie 100 highest experimentally measured scores for tiie 
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property, activity or function of interest, more preferably one of the 50 highest 
experimentally measured scores, even more preferably one of the 25 highest 
experimentaUy measured scores, even more preferably one of the 10 highest 
experimentally measured scores. 

The substitutions in the selected antibody are combined with one or more 
substitutions selected by one or more of the methods described in subsection 5.4.2. In 
preferred embodiments less than 10 selected substitutions are used, more preferably 
less than 5 selected substitutions are used, even more preferably less than 3 selected 
substitutions are used. 



10 



Method 2. An antibody tiiat has previously been tested for the one or more 
property, activity or function of interest is selected. In preferred embodiments the 
selected antibody has one of tiie 100 highest experimentally measured scores for the 
property, activity or function of interest, more preferably one of tiie 50 highest 
15 experimentally measured scores, even more preferably one of the 25 highest 
experimentally measured scores, even more preferably one of tiie 10 highest 
experimentally measured scores. 

The substitutions in the selected antibody are combmed with one or more 
substitutions selected by one or more of the methods described in subsection 5.4.2. In 
20 preferred embodiments less than 10 selected substitutions are used, more preferably 
less than 5 selected substitutions are used, even more preferably less than 3 selected 
substitutions are used. In addition, these substitutions are combined with one or more 
substitutions selected by one or more method described in Section 5.1 (i.e., by the 
methods used in step 03 of Fig. 2). In preferred embodiments, less than 10 of these 
25 last selected substitutions are used, more preferably less than 5 of these last selected 
substitutions are used, even more preferably less than 3 of tixese last selected 
substitutions are used. 

Method 3. Two or more substitutions identified by one or more of the 
30 metiiods described in subsection 5.4.2 are selected. In preferred embodiments less 
than 100 selected substitutions, more preferably less tiian 50, and even more 
preferably less than 25 are used. One or more antibody variants containing these 
substitutions are designed using the methods described in Section 5.2. 
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Method 4. One or more substitutions selected by one or more of the methods 
described m subsection 5.4.2 are selected. In preferred embodiments less than 100 
selected substitutions, more preferably less than 50, and even more preferably less 
than 25 are used. One or more substitutions are selected using one or more of the 
methods described in Section 5.1. In preferred embodiments, less than 100, and more 
preferably less than 50 of these selected substitutions are used. Then, one or more 
antibody variants are designed using the methods described in Section 5.2. 

Method 5. One or more substitutions selected by one or more of tiie methods 
described in subsection 5.4.2 tiiat contribute most positively to the property {e.g., 
function, activity of interest) are selected. In preferred embodiments, between 1 and 
20 most positive substitutions are selected. One or more antibody variant that has 
already been tested for the property is selected. In preferred embodiments, the 
between 1 and 20 most active antibodies are selected. One or more of tiie selected 
substitutions is added to each of the one or more selected antibodies. In preferred 
embodunents, tiie number of substitution positions to be added to each antibody 
variant sequence is between 1 and 10, more preferably between 1 and 6, and even 
more preferably between 1 and 3. 

Method 6. Substitutions whose regression coefficients, weights or otiier 
values describing tiie relative or absolute contribution to one or more activity of the 
antibody are positive are selected. Those substitutions whose regression coefficients, 
weights or other values describing tiie relative or absolute contribution to one or more 
activity of die antibody have confidences witiiin a threshold distance from tiie 
randomized average weight for tiiat substitution are eliminated. In preferred 
embodiments, tiiis tiueshold distance is wiflun 1 standard deviation, more preferably 
witiiin 2 standard deviations. The substitutions witii positive weights and high 
confidences are combined into a single variant. Alternatively, tiie selected 
substitutions are used to design a set of antibody variants as described in Section 5.2. 

Method 7. Substitutions are ranked in the order in which confidences can be 
assigned to regression coefficients, weights or other values describing tiie relative or 
absolute contribution to one or more activity of the antibody. The substitutions vwth 
lowest confidence scores are selected. From the sequences of antibody variants 
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Whose activities have already been measured, those that have high values for the 
property of interest are selected. In preferred embodiments, between 1 and 20 tested 
antibody variant sequences with highest activities are selected. One or more of the 
selected substitutions is added to each selected variant. In preferred embodiments, the 
5 number of substitutions to be added to each antibody variant sequence is between 1 
and 10. more preferably between 1 and 6. and even more preferably between 1 and 3. 

Method 5. One or more antibody variants that have already been tested for the 
property of interest are selected. In preferred embodiments, between 1 and 20 most 
10 active antibodies are selected. One or more substitutions for which a contribution to 
the property has been calculated are selected. For each of the one or more selected 
antibodies, the following process is performed. One of the selected substitutions is 
added or removed and the predicted activity of the resultant antibody is calculated 
using one or more models for sequence-activity relationship as described in the 
15 section 5.3. Exemplary models include, but are not limited to (i) regression 

techniques tfiat provide regression coefficients for tiie descriptors, (u) models tiiat 
generate weights or other value describing tiie relative or absolute contribution of 
each substitution or combination of substitutions to one or more activity of tiie 
antibody, (iii) models that provide standard deviation, variance or otiier measures of 
20 tiie confidence with which tiie value describing tiie contribution of tiie substitution or 
combination of substitiitions to one or more activity of tiie antibody can be assigned, 
(iv) models tiiat rank order preferred substitutions, (v) models tiiat provide additive 
and non-additive components of each substitution or combination of substitiitions, (vi) 
analytical matiiematical models tiiat can be used for analysis and prediction of tiie 
25 functions of in silica generated sequences (vii) supervised and unsupervised machine 
learning techniques like neural networks tiiat can predict tiie activity of new antibody 
sequences expressed in terms of die descriptors that are used in modeling. 

If tiie predicted activity of the new antibody is greater tiian tiie predicted value 
of tiie antibody before tiie change, tiie change is incorporated. Otherwise, tiie process 
30 reverts to tiie sequence of tiie antibody before the change. This process continues for 
a certain number of steps (preferably more than 10 steps, more preferably more than 
100 steps, even more preferably more tiian 1000 steps) or until tiie predicted activity 
of tiie antibody converges to a value. Eitiier the final antibody sequence in the series 
of iterations of ttie metiiod, or tiie antibody sequence in tiie series witii tiie highest 
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predicted activity is selected. This process can optionally be performed more than 
once starting from each initial antibody sequence. 

Method 9. As an optional addition to any of the design methods including 
methods 1, 2, 5, and 7, one or more substitutions determined to be detrimental to the 
desired property (e.g., by any of the criteria described in subsection 5.4.2 including 
criteria (xii) - (xv)) are eliminated. 

Method 10. As an optional addition to any design method, newly designed 
variants that can be reached by making a certain number of substimtions to an 
antibody sequence whose activity has ahready been measured are discarded and not 
synthesized. In preferred embodiments newly designed variants that can be reached 
by making 10 or fewer substitutions to an antibody sequence whose activity has 
already been measured are not synthesized. More preferably, newly designed variants 
that can be reached by making 5 or fewer substitutions to an antibody sequence whose 
activity has akeady been measured are not synthesized. More preferably, newly 
designed variants that can be reached by making 3 or fewer substitutions to an 
antibody sequence whose activity has already been measured are not synthesized. 
Even more preferably, newly designed variants that can be reached by making 2 or 
fewer substitutions to an antibody sequence whose activity has already been measured 
are not synthesized. Most preferably, newly designed variants that can be reached by 
making 1 to an antibody sequence whose activity has akeady been measured are not 
synthesized. 

One skilled in the art will appreciate that there are many possible ways of 
using sequence-activity information to design improved antibody variants. The 
schemes outlined above are intended to illustrate a few of the design possibilities. 

5.4.4 Methods for Modifying the Choice and CombiDations of Methods 
used to Determine Sequence-Activity Relationships 

The performances of different sequence-activity modelmg methods can be 
quantitatively compared. Such comparisons can be used to modify variable 
parameters within each method, o,r to select methods of combining the results of two 
or more sequence-activity correlating methods as outluied in Subsection 5.3.1. 
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The outputs of methods that determine sequence-activity relationship are 
outlined in Section 5.3. These outputs can be combined to calculate the predicted 
activity of an antibody and the confidence with which that activity can be predicted. 
These predictions can be compared with activity values obtained experimentaUy for 
newly designed and syntiiesized antibody variants, and tiie mettiod or methods of 
deriving sequence-activity relationships may be chosen or modified in one or more of 
the following ways. 

1. -Die weights appUed to the scores produced by tiie one or more sequence- 
activity correlating methods, for example as shown in Equation 4 or as described in 
Subsection 5.3.1 can be modified such tiiat one or more of the following are tme. 

(i) The activity value predicted for tiie most active newly designed and 
synthesized antibody variant most closely matches the experimentally determined 

activity for that variant. 

(ii) The rank order of activity values predicted for some number of tiie 
most active newly designed and syntiiesized antibody variants most closely match tiie 
experimentally determmed rank order of activity for tiiose variants. In preferred 
embodiments tiie rank order of activity values predicted for tiie 5 most active newly 
designed and syntiiesized antibody variants most closely matches tiie experimentally 
determined rank order of activity for tiiose variants, more preferably tiie rank order of 
activity values predicted for tiie 10 most active newly designed and syntiiesized 
antibody variants most closely matches tiie experimentaUy determined rank, order of 
activity for tiiose variants, even more preferably tiie rank order of activity values 
predicted for tiie 1 5 most active newly designed and syntiiesized antibody variants 
most closely matches tiie experimentally determined rank order of activity for tiiose 
variants. 

(iii) The fewest newly designed and syntiiesized antibody variants 
predicted to be more active tiian tiie initial target antibody possess experimentally 
determined activititiy tiiat is lower tiian tiie initial target antibody. 

(iv) The fewest newly designed and synthesized antibody variants 
predicted to be more active tiian tiie most active previously tested antibody possess 
experimentally determined activities tiiat are lower tiian tiie most active previously 
tested antibody. 
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2. The sequence-activity correlating method is chosen such that one or more 
of the following are true. 

(i) The activity value predicted for the most active newly designed and 
synthesized antibody variant most closely matches the experimentally determined 

5 activity for that variant. 

(ii) The rank order of activity values predicted for some number of the 
most active newly designed and synthesized antibody variants most closely match the 
experimentally determined rank order of activity for those variants. In preferred 
embodiments the rank order of activity values predicted for the 5 most active newly 

1 0 designed and synthesized antibody variants most closely matches the experimentally 
determined rank order of activity for those variants, more preferably the rank order of 
activity values predicted for the 10 most active newly designed and synthesized 
antibody variants most closely matches the experimentally determined rank order of 
activity for those variants, even more preferably the rank order of activity values 

1 5 predicted for the 1 5 most active newly designed and synthesized antibody variants 
most closely matches the experimentally determined rank order of activity for those 
variants. 

(iii) The fewest newly designed and synthesized antibody variants 
predicted to be more active than the initial target antibody possess experimentally 

20 determined activitities that are lower than the initial target antibody. 

(iv) The fewest newly designed and synthesized antibody variants 
predicted to be more active than the most active previously tested antibody possess 
experimentally determined activitities that are lower than the most active previously 
tested antibody. 



25 
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3. In some embodiments, the process of steps 1 or 2 can be performed using 
regression techniques, machine learning or other multivariate data analysis tools to 
calculate or minimize the differences between the values predicted by the sequence- 
activity relationship, and those observed experimentally. 

4. In some embodiments, the process of steps 1 or 2 can be performed using 
values predicted by the sequence-activity relationship, and those observed 
experimentally for more than one set of antibodies. 
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5. In some embodiments the process of step 4 can be performed using two or 
more datasets from antibodies that fall into the same class and subclass. For example, 
two or more sets of IgG antibodies, two or more sets of IgE antibodies, two or more 
sets of single chain antibodies, two or more sets of Fab fragments.. Weights for 
expert system rules 120 that are modified using two or more datasets from antibodies 
of the same class and subclass can be stored, for example in knowledge base 108 or 
case-specific data 1 10. These weights or choices for sequence-activity determining 
methods can then be used by expert system 100 when a subsequent target antibody 
sequence and activity dataset of that class and subclass is presented. 



10 



6. In some embodiments the process of step 4 can be performed using two or 
more datasets from antibodies that fall into the same class. For example two or more 
sets of human antibodies, two or more sets of murine antibodies. Weights for expert 
system 100 rules 120 that are modified using two or more datasets from antibodies of 
15 the same class can be stored, for example in knowledge base 108 or case-specific data 
110. These weights for expert system 100 rules 120 can then be used by expert 
system 100 when a subsequent target antibody sequence and activity dataset of that 
class and subclass is presented. 



20 



5 5 USE OF SEOUENCE-ACnVITY RELATIONSHIPS TO TRAIN AN 
I^ERT SYsS FOR SUBSTITUTION IDENTIFICATION 



more i 



The endpoint of a process of antibody optimization is reached when one or 
antibodies are obtained vnih one or more properties at tiie levels defined by a 
-25 user, these activity levels being appropriate to allow tiie use of the antibody in 
performing a specific task. This corresponds to Fig. 2 step 08. 

In addition to designmg improved antibody variants, information from 
sequence-activity relationships can be used to provide information to improve the 
initial selection of substitutions, for example by modifying tiie weights applied to the 
30 scores produced by the expert system 100 as described in Section 5.1. Asanexample, 
tiie weights can be modified according to the following process. 

1. As described in Section 5.3. tiie sequence-activity relationship can be used 
to calculate (i) a regression coefficient, weight or otiier value describing tiie relative or 
absolute contribution of each substitution or combination of substitiitions to one or 
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more activity of the antibody, (ii) a standard deviation, variance or other measure of 
the confidence with which the value describing the contribution of the substitution or 
combination of substitutions to one or more activity of the antibody can be assigned, 
and/or (iii) a rank order of preferred substitutions. 

5 2. The results of applying two or more rules 120 of expert system 100 are 

combined and can be used to obtain (i) a score describing the predicted effect of a 
substitution upon one or more antibody property, (ii) a probability or confidence 
describing the predicted effect of a substitution upon one or more antibody property, 
activity or function, or (iii) a predicted rank order of preferred substitutions. Different 

1 0 values for each of these predictions can result from modifications of the weights 
applied to the scores produced by expert system 100 as described in Section 5.1, for 
example as shown in equations (1) or (2). 

3. The weights applied to the scores produced by expert system 100 can be 
modified such that one or more of the following are true. 

1 5 (i) The regression coefficient, weight or other value describing the 

relative or absolute contribution of each substitution or combination of substitutions 
to one or more activity of the aiitibody that is derived from the sequence-activity 
relationship more closely corresponds with the score describing the predicted effect of 
a substitution upon one or more antibody property, activity or fimction that is derived 

20 from expert system 100. 

(ii) The standard deviation, variance or other measure of the 
confidence with which the value describing the contribution of the substitution or the 
combination of substitutions to one or more activity of the antibody can be assigned 
that is derived from the sequence-activity relationship more closely corresponds with 

25 the probability or confidence describing the predicted effect of a substitution upon one 
or more antibody property, activity or fimction that is derived from expert system 100. 

(iii) The rank order of preferred substitutions that is derived from the 
sequence-activity relationship more closely corresponds with the predicted rank order 
of preferred substitutions that is derived from expert system 100. 

30 4, In some embodiments, the process of steps 1 to 3 can be performed using 

regression techniques, machine learning or other multivariate data analysis tools to 
minimize the differences between the values obtained from the sequence-activity 
relationship, and those predicted by expert system 100. 

5. In some embodiments, the process of steps 1 to 3 can be performed using 
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expert system 1 00 predictions and sequence-activity relationships for more than one 
set of antibodies. 

6. In some embodiments the process of step 5 can be performed using two or 
more datasets from antibodies that fall into the same class and subclass. For example, 
5 two or more sets of , two or more sets of IgG antibodies, two or more sets of IgE 
antibodies, two or more sets of single chain antibodies, two or more sets of Fab 
fragments. Weights for expert system 100 rules 120 that are modified using two or 
more datasets from antibodies of the same class and subclass can be stored, for 
example in knowledge base 108 or case-specific data 1 1 0. These weights for expert 
10 system rules 120 can then be used by expert system 100 when a subsequent target 
antibody of that class and subclass is presented. 

7. In some embodiments the process of step 5 can be performed using two or 
more datasets from antibodies that Ml into the same class. For example, two or more 
sets of human antibodies, two or more sets of murine antibodies. Weights for expert 
15 system 100 rules 120 that are modified using two or more datasets from antibodies of 
the same class can be stored, for example ux knowledge base 108 or case-specific data 
1 1 0. These weights for rules 120 can then be used by expert system 100 when a 
subsequent target antibody of that class is presented. 

By using a formal system for substitution selection, predictions made by 
20 expert system 100 can be improvedso that preferences (e.g. higher weights) are given 
to selection methods 130 that have performed well in previous iterations. 

Different algorithms and methods for identifying productive substitutions and 
for deriving sequence activity relationships may be better suited to different types of 
antibody, including different animal origins, different antibody fragments, 
25 optimization compared with humanization. 

By using feedback loops of this nature, where quantitative scoring or ranking 
protocols are developed, a learning, automated computational system for antibody 
optimization can be developed. This system could include generic information 
applicable to all antibody classes and specific information applicable to a more 

30 limited subset of antibodies. 

Such a computational system could be made available directly, via the internet 

and / or on a subscription basis. 
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5.6 UTILITY OF THE VARIANTS OF THIS INVENTION 

Other useful products produced by the method of the invention include 
antibodies incorporating substitutions identified through construction and 
5 characterizing sets of variant antibodies. Where the antibody is encoded by a 

polynucleotide this also includes vectors (including expression vectors) comprising 
such polynucleotides, host cells comprising such polynucleotides and/or vectors, and 
libraries of antibodies, and libraries of host cells comprising and/or expressing such 
libraries of antibodies. 
1 0 The antibodies developed using the methods of the invention can be used 

alone or in combination with other prophylactic or therapeutic agents for treatmg, 
managing, preventing or ameliorating a disorder or one or more symptoms thereof 

The present invention provides methods for preventing, managing, treating, or 
ameliorating a disorder comprising administering to a subject in need thereof one or 
1 5 more antibodies of the invention alone or in combination with one or more therapies 
(e.g., one or more prophylactic or therapeutic agents) other than an antibody of the 
invention. The present invention also provides compositions comprising one or more 
antibodies of the invention and one or more prophylactic or therapeutic agents other 
than antibodies of the invention and methods of preventing, managing, treating, or 
20 ameliorating a disorder or one or more symptoms thereof utilizing said compositions. 
Therapeutic or prophylactic agents include, but are not limited to, small molecules, 
synthetic drugs, peptides, polypeptides, proteins, nucleic acids (e.g., DNA and RNA 
nucleotides mcluding, but not limited to, antisense nucleotide sequences, triple 
helices, RNAi, and nucleotide sequences encoding biologically active proteins, 
25 polypeptides or peptides) antibodies, synthetic or natural inorganic molecules, 
mimetic agents, and synthetic or natural organic molecules. 

Any therapy that is known to be useful, or that has been used or is currently 
being used for the prevention, management, treatment, or amelioration of a disorder 
or one or more symptoms thereof can be used in combination with an antibody of the 
30 invention in accordance with the invention described herein. See, e.g.. Oilman et ai, 
Goodman and Oilman's: The Phamiacological Basis of Therapeutics, 10th ed., 
McGraw-Hill, New York, 2001; The Merck Manual of Diagnosis and Therapy, 
Berkow, M.D. et al. (eds.), 17th Ed., Merck Sharp & Dohme Research Laboratories, 
Rahway, NJ, 1999; Cecil Textbook of Medicine, 20th Ed., Bennett and Plum (eds.). 
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W.B. Saunders, Philadelphia, 1996 for information regarding therapies (eg., 
prophylactic or therapeutic agents) that have been or are currenUy being used for 
preventing, treating, managing, or ameliorating a disorder or one or more symptoms 
thereof. Examples of such agents include, but are not limited to, immunomodulatory 
5 agents, anti-inflammatory agents {e.g., adrenocorticoids. corticosteroids 
beclomethasone, budesonide, flunisolide. fluticasone, triamcinolone, 
methlyprednisolone, prednisolone, prednisone, hydrocortisone), glucocorticoids, 
steroids, non-steriodal anti-inflammatory drugs (e.g.. aspirin, ibuprofen. diclofenac, 
and COX-2 inhibitors), pain relievers, leukotreine antagonists (e.g.. montelukast, 
10 methyl xanthines, zafirlukast, and zileuton). beta2-agonists (e.g., albuterol, biterol, 
fenoterol. isoetharie. metaproterenol, pirbuterol, salbutamol, terbutalin formoterol. 
salmeterol, and salbutamol terbutaline), anticholinergic agents (e,g., ipratropium 
bromide and oxitropium bromide), sulphasalazine, penicUlamine, dapsone, 
antihistamines, anti-malarial agents (e.g., hydroxychloroquine), anti-viral agents, and 
1 5 antibiotics (e.g., dactinomycin (formerly actinomycin), bleomycin, erythomycin. 
penicillin, mithramycin, and anthramycin (AMC)). 

The antibodies of the invention can be used directiy against a particular 
antigen. In some embodiments, antibodies of the invention belong to a subclass or 
isotype tiiat is capable of mediating tiie lysis of cells to which tiie antibody binds. In a 
20 specific embodiment, the antibodies of the invention belong to a subclass or isotype 
tiiat. upon complexing witii cell surface proteins, activates serum complement and/or 
mediates antibody dependent cellular cytotoxicity (ADCC) by activating effector cells 
such as natural killer cells or macrophages. 

The biological activities of antibodies are known to be determined, to a large 
25 extent, by the constant domains or Fc region of tiie antibody molecule (Uananue and 
Benacerraf, Textbook of Immunology, 2nd Edition. Williams & Wilkins, p. 218 
(1984)). This includes their ability to activate complement and to mediate antibody- 
dependent cellular cytotoxicity (ADCC) as effected by leukocytes. Antibodies of 
different classes and subclasses differ in this respect, as do antibodies from the same 
30 subclass but different species; according to the present invention, antibodies of tiiose 
classes having tiie desired biological activity are prepared. 

In general, mouse antibodies of tiie IgG2a and IgG3 subclass and occasionally 
IgGl can mediate ADCC, and antibodies of the IgG3, IgG2a. and IgM subclasses 
bind and activate serum complement. Complement activation generally requires tiie 
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binding of at least two IgG molecules in close proximity on the target cell. However, 
the binding of only one IgM molecule activates serum complement. 

The ability of any particular antibody to mediate lysis of the target cell by 
complement activation and/or ADCC can be assayed. The cells of interest are grown 

5 and labeled in vitro; the antibody is added to the cell culture in combination with 
either serum complement or immune cells which may be activated by the antigen 
antibody complexes. Cytolysis of the target cells is detected by the release of label 
from the lysed cells. In fact, antibodies can be screened using the patient's own serum 
as a source of complement and/or immune cells. The antibody that is capable of 

1 0 activating complement or mediating ADCC in the in vitro test can then be used 
therapeutically in that particular patient. 

Use of IgM antibodies may be preferred for certain applications, however IgG 
molecules by being smaller may be more able than IgM molecules to localize to 
certain types of infected cells. 

15 In some embodiments, the antibodies of this invention are useful in passively 

immimizing patients. 

The antibodies of the invention can also be used in diagnostic assays either in 
vivo or in vitro for detection/identification of the expression of an antigen in a subject 
or a biological sample (e.g., cells or tissues). Non-limiting examples of using an 

20 antibody, a fragment thereof, or a composition comprising an antibody or a fragment 
thereof in a diagnostic assay are given in U.S. Patent Nos. 6,392,020; 6,156,498; 
6,136,526; 6,048,528; 6,015,555; 5,833,988; 5,811,310; 8 5,652,114; 5,604,126; 
5,484,704; 5,346,687; 5,318,892; 5,273,743; 5,182,107; 5,122,447; 5,080.883; 
5,057,313; 4,910.133; 4,816,402; 4,742,000; 4.724,213; 4,724,212; 4,624,846; 

25 4,623.627; 4,618,486; 4,176,174 (all of which are incorporated herein by reference). 
Suitable diagnostic assays for the antigen and its antibodies depend on the particular 
antibody used. Non-limiting examples are an ELISA, sandwich assay, and steric 
inhibition assays. For in vivo diagnostic assays using the antibodies of the invention, 
the antibodies may be conjugated to a label that can be detected by imaging 

30 techniques, such as X-ray, computed tomography (CT), ultrasound, or magnetic 
resonance imaging (MRI). The antibodies of the invention can also be used for the 
affinity purification of the antigen from recombinant cell culture or natural sources. 
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5.7 DEFINITIONS 



It is to be understood that this invention is not limited to the particular 
methodology, devices, solutions or apparatuses described, as such methods, devices, 
solutions or apparatuses can, of course, vary. It is also to be understood that the 
terminology used herein is for the purpose of describing particular embodiments only, 
and is not intended to limit the scope of the present invention. 

Unless defined otherwise herein, all technical and scientific tenns used herein 
have the same meaning as commonly understood by one of ordinary skill in the art to 
which this invention belongs. Singleton a/.. Dictionary Of Microbiology And 
Molecular Biology. 2"- ed.. John Wiley and Sons. New York (1994), and Hale & 
Marham, The Harper Collins Dictionary Of Biology, Harper Peremual. NY (1991). 
provide one of skUl with a general dictionary of many of the terms used in this 
invention. Bioinformatic terms referring to expert systems are used in the same sense 
that they appear in Jackson, Introduction To E^ert Systems, 3^ ed., Addison-Wesley, 
NY (1999) Although any methods and materials similar or equivalent to those 
described herein can be used in the practice or testing of the present invention, the 
preferred methods and materials are described. Unless otherwise indicated, nucleic 
acids are written left to right in 5' to 3> orientation; amino acid sequences are wntten 
left to right in amino to carboxy orientation, respectively. lUe headings provided 
herein are not limitations on tiie invention, but exemplify tiie various aspects of the 
invention. Accordingly, the terms defined immediately below are more fiilly defined 
by reference to tiie specification as a whole. 

The terms "polynucleotide." "oligonucleotide," "nucleic acid" and "nucleic 
acid molecule" and "gene" are used interchangeably herein to refer to a polymeric 
form of nucleotides of any lengtii. and may comprise ribonucleotides, 
deoxyribonucleotides. analogs thereof, or mixtures tiiereof. This term refers only to 
tiie primary structure of the molecule. Thus, tfie term includes triple-, double- and 
single-stranded deoxyribonucleic acid ("DNA"). as well as tiiple-. double- and single- 
stranded ribonucleic acid ("RNA"). It also includes modified, for example by 
alkylation. and/or by capping, and umnodified fomis of tiie polynucleotide. More 
particularly, tiie terms "polynucleotide." "oligonucleotide," "nucleic acid" and 
"nucleic acid molecule" include polydeoxyribonucleotides (containing 2-deoxy-D- 
ribose). polyribonucleotides (containing D-ribose). including tRNA. rRNA. hRNA. 



80 



c 

wo 2005/012877 



siRNA and mKNA, whether spliced or unspliced, any other type of polynucleotide 
which is an N- or C-glycoside of a purine or pyrimidine base, and other polymers 
containing nonnucleotidic backbones, for example, polyamide (e.g., peptide nucleic 
acids ("PNAs")) and polymoipholino (commercially available from the Anti-Virals, 

5 Inc., Corvallis, Oreg., as Neugene) polymers, and other synthetic sequence-specific 
nucleic acid polymers providing that the polymers contain nucleobases in a 
configuration which allows for base pairing and base stacking, such as is found in 
DNA and RNA. There is no mtended distinction in length between the terms 
"polynucleotide," "oligonucleotide," "nucleic acid" and "nucleic acid molecule," and 

1 0 these terms are used interchangeably herein. These terms refer only to the primary 
structure of the molecule. Thus, these terms include, for example, 3'-deoxy-2', 5'- 
DNA, oligodeoxyribonucleotide N3* P5' phosphoramidates, 2'-0-alkyI-substittited 
RNA, double- and single-stranded DNA, as well as double- and single-stranded RNA, 
and hybrids thereof including for example hybrids between DNA and RNA or 

1 5 between PNAs and DNA or RNA, and also include known types of modifications, for 
example, labels, alkylation, "caps," substitution of one or more of the nucleotides with 
an analog, intemucleotide modifications such as, for example, those with uncharged 
linkages (e.g., methyl phosphonates, phosphotriesters, phosphoramidates, carbamates, 
etc.), with negatively charged linkages (e.g., phosphorothioates, phosphorodithioates, 

20 etc.), and with positively charged Imkages (e.g., aminoalkylphosphoramidates, 

aminoalkylphosphotriesters), those contaming pendant moieties, such as, for example, 
proteins (including enzymes (e.g. nucleases), toxins, antibodies, signal peptides, poly- 
L-lysine, etc.), those with intercalators (e.g., acridine, psoralen, etc.), those containing 
chelates (of, e.g., metals, radioactive metals, boron, oxidative metals, etc.), those 

25 containing alkylators, those with modified linkages (e.g., alpha anomeric nucleic 
acids, etc.), as well as unmodified forms of the polynucleotide or oligonucleotide. 

Where the polynucleotides are to be used to express encoded proteins, 
nucleotides which can perform that fimction or which can be modified (e.g., reverse 
transcribed) to perform that function are used. Where the polynucleotides are to be 

30 used in a scheme which requires that a complementary strand be formed to a given 
polynucleotide, nucleotides are used which permit such formation. 

It will be appreciated that, as used herein, the terms "nucleoside" and 
"nucleotide" will include those moieties which contain not only the known purine and 
pyrimidine bases, but also other heterocyclic bases which have been modified. Such 
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modifications include methylated purines or pyrimidines, acylated purines or 
pyrimidines. or other heterocycles. Modified nucleosides or nucleotides can also 
include modifications on the sugar moiety, e.g., wherein-one or more of the hydroxyl 
groups are replaced with halogen, aliphatic groups, or are fimctionalized as ethers, 
amines, or the like. The term "nucleotidic unit" is intended to encompass nucleosides 
and nucleotides. 

Furthermore, modifications to nucleotidic units include rearranging, 
appending, substituting for or otherwise altering fimctional groups on the purine or 
pyrimidine base which form hydrogen bonds to a respective complementary 
pyrimidine or purine. "Die resultant modified nucleotidic unit optionally may form a 
base pair with other such modified nucleotidic units but not with A. T, C, G or U. 
Abasic sites may be incorporated which do not prevent the function of the 
polynucleotide. Some or all of the residues in the polynucleotide can optionally be 

modified in one or more ways. 

15 Standard A-T and G-C base pairs form under conditions which allow the 

formation of hydrogen bonds between the NS-H and C4K)xy of thymidine and the NI 
and C6-NH2, respectively, of adenosine and between the C2-oxy. N3 and C4~NH2, 
of cytidine and the C2-NH2, K-H and C6.oxy, respectively, of guanosine. Thus, for 
example, guanosine (2-amino-6-oxy-9-.beta.-D-ribofiiranosyl-purine) may be 

20 modified to form isoguanosine (2.oxy-6.amino.9-.beta..D.ribofiiranosyl.purine). 

Such modification results in a nucleoside base which will no longer effectively form a 
standard base pair with cytosine. However, modification of cytosine (l-.beta.-D- 
ribofuranosyl-2-oxy-4-amino-pyrimidi- ne) to form isocytosine (l-.beta.-D- 
ribofuranosyl-2-amino-4-oxy-pyrimidine- ) results in a modified nucleotide which 

25 will not effectively base pair with guanosine but will form a base pair with 

isoguanosine (U.S. Pat. No. 5,681,702 to Collins et al.). Isocytosine is available fiom 
Sigma Chemical Co. (St. Louis, Mo.); isocytidine may be prepared by the method 
described by Switzer et al. (1993) Biochemistry 32:10489-10496 and references cited 
therein; 2'-deoxy-5-methyl-isocytidine may be prepared by the method of Tor et al. 
30 (1993) J. Am. Chem. Soc. 1 15:4461-4467 and references cited therein; and 

isoguanine nucleotides may be prepared using the method described by Switzer et al. 
(1993). supra, and Mantsch et al. (1993) Biochem. 14:5593-5601, or by the method 
described in U.S. Pat. No. 5,780.610 to Collins et al. Other nonnatural base pairs may 
be synthesized by the method described in Piccirilli et al. (1990) Nature 343:33-37 for 
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the synthesis of 2,6-diaminopyrimidine and its complement (l-methylpyrazolo- 
[4,3]pyrimidine-5,7-(4H,6H)-dione. Other such modified nucleotidic vmits which 
form unique base pairs are known, such as those described in Leach et ai. (1992) J. 
Am. Chem. Soc. 1 14:3675-3683 and Switzer et al., supra. 
5 The phrase "DNA sequence" refers to a contiguous nucleic acid sequence. The 

sequence can be either single sljranded or double stranded, DNA or RNA, but double 
stranded DNA sequences are preferable. The sequence can be an oligonucleotide of 6 
to 20 nucleotides in length to a full length genomic sequence of thousands of base 
pairs. 

10 The term "protein" refers to contiguous "amino acids" or amino acid 

"residues." Typically, protems have a function. However, for purposes of this 
invention, proteins also encompasses polypeptides and smaller contiguous amino acid 
sequences that do not have a functional activity. "Polypeptide" and "protein" are used 
interchangeably herein and include a molecular chain of amino acids linked through 

15 peptide bonds. The terms do not refer to a specific length of the product. Thus, 
"peptides," "oUgopeptides," and "protems" are included within the definition of 
polypeptide. The terms include polypeptides contaimng in co- and/or post- 
translational modifications of tiie polypeptide made in vivo or in vitro, for example, 
glycosylations, acetylations, phosphorylations, PEGylations and sulphations. In 

20 addition, protein fragments, analogs (including amino acids not encoded by tiie 

genetic code, e.g. homocysteine, omitiiine, p-acetylphenylalanme, D-amino acids, and 
creatine), natural or artificial mutants or variants or combinations tiiereof, fusion 
proteins, derivatized residues (e.g. alkylation of amine groups, acetylations or 
esterifications of carboxyl groups) and the like are included witiiin tiie meaning of 

25 polypeptide. 

"Amino acids" or "amino acid residues" may be referred to herein by eitiier 
ttieir commonly known three letter symbols or by tiie one-letter symbols 
recommended by tiie lUPAC-IUB Biochemical Nomenclature Commission. 
Nucleotides, likewise, may be referred to by tiieir commonly accepted single-letter 
30 codes. 

"Sequence variants" refers to variants of discrete antibodies (tiiat is antibodies 
whose sequence can be uniquely defmed) mcluding polynucleotide and polypeptide 
and variants. Sequence variants are sequences that are related to one another or to a 
common nucleic acid or amino acid "reference sequence" but contain some 
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differences in nucleotide or amino acid sequence from each other. These changes can 
be transitions, transversions, conservative substitutions, non-conservative 
substitutions, deletions, insertions or substitutions with non-naturaUy occurring 
nucleotides or amino acids (mimetics). The phrase "optimizing a sequence" refers to 
5 the process of creating nucleic acid or protein variants so that the desired functionality 
and or properties of the protein or nucleic acid are improved. One of skill wUl realize 
that optimizing an antibody could involve selecting a variant vdth lower fimctionaUty 
than the parental protein if that is desired. 

The term "antibody" as used herein includes antibodies obtained from botii 
10 polyclonal and monoclonal preparations, as well as: hybrid (chimeric) antibody 

molecules (see. for example. Winter et al. (1991) Nature 349:293-299; and U.S. Pat. 
No 4 816,567); F(ab')2 and F(ab) fragments; Fv molecules (noncovalent 
heter^iimers, see. for example. Inbar et al. (1972) ProcNatl Acad Sci USA 69:2659- 
2662- and Ehrlich et al. (1980) Biochem 19:4091-4096); single-chain Fv molecules 
1 5 (sFv) (see. for example. Huston et al. (1988) Proc Nati Acad Sci USA 85:5879-5883); 
dimeric and trimeric antibody fragment constructs; minibodies (see. e.g.. Pack et al. 
(1992) Biochem 31 :1 579-1584; Cumber et al. (1992) J Immunology 1498:120-126); 
humanized antibody molecules (see. for example. Riechmann et al. (1988) Natin^ 
332:323-327; Verhoeyan et al. (1988) Science 239:1534-1536; and U.K. Patent 
20 PubUcation No. GB 2.276.169, published Sep. 21, 1994); and, any fimctional 

fragments obtained from such molecules, wherein such fragments retain specific- 
binding properties of tiie parent antibody molecule. 

As used herein, tiie terms "antibody" and "antibodies" further refer to 
monoclonal antibodies, multispecific antibodies, human antibodies, humanized 
25 antibodies, camelised antibodies, chimeric antibodies, single-chain Fvs (scFv). single 
chain antibodies, single domain antibodies, Fab fragments, F(ab) fragments, disulfide- 
linked Fvs (sdFv). anti-idiotypic (anti-Id) antibodies, and epitope-binding fragments 
of any of tiie above. In particular, antibodies include immunoglobulin molecules and 
immunologically active fragments of immunoglobulin molecules, i.e., molecules tiiat 
30 contain an antigen binding site. Immunoglobulin molecules can be of any type {e.g., 
IgG, IgE, IgM. IgD, IgA and IgY), class {e.g., IgG,. IgGj. IgGj. IgG*. IgA, and IgAj) 
or subclass. 

A typical antibody contains two heavy chains paired with two light chains. A 
full-length heavy chain is about 50 kD in size (approximately 446 amino acids in 
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length), and is encoded by a heavy chain variable region gene (about 116 amino 
acids) and a constant region gene. There are different constant region genes encoding 
heavy chain constant region of different isotypes such as alpha, gamma (IgGl, IgG2, 
IgG3, IgG4), delta, epsilon, and mu sequences. A full-length light chain is about 25 
5 Kd in size (approximately 214 amino acids in length), and is encoded by a light chain 
variable region gene (about 110 amino acids) and a kappa or lambda constant region 
gene. The variable regions of the light and/or heavy chain are responsible for binding 
to an antigen, and the constant regions are responsible for the effector functions 
typical of an antibody. 

1 0 As used herein, the term "CDR" refers to the complement determining region 

within antibody variable sequences. There are three CDRs in each of the variable 
regions of the heavy cham and the light chain, which are designated CDRl, CDR2 
and CDRS, for each of the variable regions. The exact boundaries of these CDRs 
have been defined differently according to different systems. The system described 

1 5 by Kabat (Kabat et al.. Sequences of Proteins of Inmnmological Interest, National 
Institutes of Health, Bethesda, MD (1987) and (1991)) not only provides an 
unambiguous residue numbering system applicable to any variable region of an 
antibody, but also provides precise residue boundaries defining the three CDRs. 
These CDRs may be referred to as Kabat CDRs. Chothia and coworkers (Chothia & 

20 Lesk, J. Mol. Biol. 196:901-917, 1987. and Chothia etal. Nature 342:877-883 
(1 989)) foimd that certain sub-portions within Kabat CDRs adopt nearly identical 
peptide backbone conformations, despite having great diversity at the level of amino 
acid sequence. These sub-portions were designated as LI, L2 and L3 or HI, H2 and 
H3 where the "L" and the "H" designates the light chain and the heavy chains regions, 

25 respectively. These regions can be referred to as Chothia CDRs, which have 
boundaries that overiap with Kabat CDRs. Other boundaries defining CDRs 
overlapping with the Kabat CDRs have been described by Padlan (FASEB J. 9:133- 
139 (1995)) and MacCallum (J Mol Biol 262(5):732-45 (1996)). Still other CDR 
boundary definitions may not strictly follow one of the above systems, but will 

30 nonetheless overiap with the Kabat CDRs, although they may be shortened or 

lengthened in light of prediction or experimental findings that particular residues or 
groups of residues or even entire CDRs do not significantly impact antigen buiding. 
The methods used herem can utilize CDRs defined according to any of these systems, 
although preferred embodiments use Kabat or Chothia defmed CDRs. 
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As used herein, the term "epitopes" refers to fragments of a polypeptide or 
protein having antigenic or immunogenic activity in an animal, preferably in a 
mammal, and most preferably in a human. An epitope having immunogenic activity 
is a fragment of a polypeptide or protein that elicits an antibody response in an 
animal. An epitope having antigenic activity is a fragment of a polypeptide or protein 
to which an antibody immunospecifically bmds as determined by any method weU- 
known to one of skill in the art, for example by immunoassays. Antigenic epitopes 
need not necessarily be immmiogenic. 

As used herein, the term "fragment" refers to a peptide or polypeptide 
(including, but not limited to an antibody) comprising an amino acid sequence of at 
least 5 contiguous amino acid residues, at least 10 contiguous amino acid residues, at 
least 15 contiguous amino acid residues, at least 20 contiguous amino acid residues, at 
least 25 contiguous amino acid residues, at least 40 contiguous amino acid residues, at 
least 50 contiguous amino acid residues, at least 60 contiguous amino residues, at 
least 70 contiguous amino acid residues, at least contiguous 80 amino acid residues, at 
least contiguous 90 amino acid residues, at least contiguous 100 amino acid residues, 
at least contiguous 125 amino acid residues, at least 150 contiguous amino acid 
residues, at least contiguous 175 amino acid residues, at least contiguous 200 amino 
acid residues, or at least contiguous 250 amino acid residues of the amino acid 
sequence of another polypeptide or protein. In a specific embodiment, a fragment of a 
protein or polypeptide retains at least one function of the protein or polypeptide. 

As used herein, the term "immunospecifically binds to an antigen" and 
analogous terms refer to peptides, polypeptides, proteins (including, but not limited to 
fusion proteins and antibodies) or fragments thereof tiiat specifically bind to an 
antigen or a fragment and do not specifically bind to other antigens. A peptide, 
polypeptide, or protein that immunospecifically bmds to an antigen may bind to other 
antigens witii lower affinity as determined by, e.g., immunoassays, BIAcore, or otiier 
assays known in the art. Antibodies or fragments tiiat immunospecifically bind to an 
antigen may be cross-reactive witii related antigens. Preferably, antibodies or 
fragments that immunospecifically bind to an antigen do not cross-react with otiier 
antigens. 

As used herein, the term "in combination" refers to the use of more than one 
tiierapies (e.g., more tiian one prophylactic agent and/or therapeutic agent). The use 
of tiie term "in combination" does not restrict the order in which therapies (e.g., 
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prophylactic and/or therapeutic agents) are administered to a subject A first therapy 
(e.g., a first prophylactic or therapeutic agent) can be administered prior to {e.g., 5 
minutes, 15 minutes, 30 minutes, 45 minutes, 1 hour, 2 hoiirs, 4 hours, 6 hours, 12 
hours, 24 hours, 48 hours, 72 hours, 96 hours, 1 week, 2 weeks, 3 weeks, 4 weeks, 5 
5 weeks, 6 weeks, 8 weeks, or 12 weeks before), concomitantly with, or subsequent to 
{e.g., 5 minutes, 15 minutes, 30 minutes, 45 minutes, 1 hour, 2 hours, 4 hours, 6 
hours, 12 hours, 24 hours, 48 hours, 72 hours, 96 hours, 1 week, 2 weeks, 3 weeks, 4 
weeks, 5 weeks, 6 weeks, 8 weeks, or 12 weeks after) the administration of a second 
therapy {e.g., a second prophylactic or therapeutic agent) to a subject. 
1 0 As used herein, the term "pharmaceutically acceptable" refers approved by a 

regulatory agency of the federal or a state government, or listed in the U.S. 
Pharmacopeia, European Pharmacopeia, or other generally recognized pharmacopeia 
for use in animals, and more particularly, in humans. 

As used herein, the terms "prevent," '"preventing," and "prevention" refer to 
15 the inhibition of the development or onset of a disorder or the prevention of the 

recurrence, onset, or development of one or more symptoms of a disorder m a subject 
resulting from tiie administration of a tiierapy {e.g., a prophylactic or tiierapeutic 
agent), or the administration of a combination of tiierapies (e.g., a combination of 
prophylactic or therapeutic agents). 
20 As used herein, tiie terms "prophylactic agent" and "prophylactic agents" refer 

to any agent(s) which can be used in the prevention of a disorder or one or more of the 
symptoms thereof. In certain embodiments, the term "prophylactic agent" refers to an 
antibody of the invention. In certain other embodiments, the term "prophylactic 
agent" refers to an agent other than an antibody of the invention. Preferably, a 
25 prophylactic agent is an agent which is known to be useful to or has been or is 

currentiy being used to the prevent or impede the onset, development, progression 
and/or severity of a disorder or one or more symptoms thereof. 

As used herein, tixe term "prophylactically effective amount" refers to the 
amount of a therapy {e.g., prophylactic agent) which is sufficient to result in the 
30 prevention of the development, recurrence, or onset of a disorder or one or more 
symptoms thereof, or to enhance or improve tiie prophylactic effect(s) of another 
therapy (e.g., a prophylactic agent). 

As used herein, tiie terms "subject" and "patient" are used interchangeably. 
As used herein, tiie terms "subject" and "subjects" refer to an animal, preferably a 
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mammal including a non-primate {e.g., a cow, pig, horse, cat, dog. rat, and mouse) 
and a primate (e.g. , a monkey, such as a cynomolgous monkey, a chimpanzee, and a 
human), and most preferably a human. In one embodiment, the subject is a non- 
human animal such as a bird (e.g., a quail, chicken, or turkey), a farm animal {e.g., a 
cow, horse, pig, or sheep), a pet (e.g., a cat, dog, or guinea pig), or laboratory animal 
(e.g., an animal model for a disorder). In a preferred embodiment, the subject is a 
human {e.g., an infant, child, adult, or senior citizen). 

As used herein, the terms "therapeuUc agent" and "therapeutic agents" refer to 
any agent(s) which can be used in the prevention, treatment, management, or 
amelioration of a disorder or one or more symptoms tiiereof. In certain embodiments, 
the term "therapeutic agent" refers to an antibody of tiie invention. In certain other 
embodiments, the term "therapeutic agent" refers an agent other than an antibody of 
tiie invention. Preferably, a therapeutic agent is an agent which is known to be useful 
for, or has been or is currently being used for the prevention, treatinent, management, 
or amelioration of a disorder or one or more symptoms tiiereof. 

As used herein, tiie term "tiierapeuticaUy effective amount" refers to tiie 
amount of a therapy {e.g., an antibody of tiie invention), which is sufficient to reduce 
die severity of a disorder, reduce tiie duration of a disorder, ameUorate one or more 
symptoms of a disorder, prevent tiie advancement of a disorder, cause regression of a 
disorder, or enhance or improve tiie therapeutic effect(s) of anotiier tiierapy. 

As used herein, tiie terms "tiierapies" and "tiierapy" can refer to any 
protocol(s), metiiod(s), and/or agent(s) tiiat can be used in tiie prevention, treatment, 
management, and/or amelioration of a disorder or one or more symptoms tiiereof. In 
certain embodunents, tiie terms "tiierapy" and "tiierapy" refer to anti-viral tiierapy, 
anti-bacterial therapy, anti-fungal tiierapy, anti-cancer agent, biological tiierapy. 
supportive tiierapy. and'or otiier tiierapies useful in treatment, management, 
prevention, or amelioration of a disorder or one or more symptoms tiiereof known to 
one skilled in tiie art, for example, a medical professional such as a physician. 

As used herein, tiie terms "treat," "treatinent," and "treating" refer to tiie 
reduction or amelioration of the progression, severity, and/or duration of a disorder or 
amelioration of one or more symptoms tiiereof resulting firom tiie administration of 
one or more tiierapies (including, but not limited to, tiie administration of one or more 
prophylactic or therapeutic agents). 

The term "sequence alignment" refers to tiie resuU when at least two antibody 
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sequences are compared for maximum correspondence, as measured using a sequence 
comparison algorithms. Optimal alignment of sequences for comparison can be 
conducted by any technique known or developed in the art, and the invention is not 
intended to be limited in the alignment technique used. Exemplary alignment methods 

5 include the local homology algorithm of Smith &. Waterman, Adv. Appl. Math. 2:482 
(1981), the homology alignment algorithm of Needleman & Wunsch, J. Mol. Biol. 
48:443 (1970), the search for similarity method of Pearson Sc Lipman, Proc. Natl. 
Acad. Sci. USA 85:2444 (1988), by computerized implementations of these 
algorithms (e.g., GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics 

1 0 Software Package, Genetics Computer Group, 575 Science Dr., Madison, Wis.), and 
by inspection. 

The "three dimensional structure" of a protein is also termed the "tertiary 
structure" or the structure of the protein in three dimensional space. Typically the 
three dimensional structure of a protein is determined through X-ray crystallography 

15 and the coordinates of the atoms of the amino acids determined. The coordinates are 
then converted through an algorithm into a visual representation of the protein in three 
dimensional space. From this model, the local "environment" of each residue can be 
determined and the "solvent accessibility" or exposure of a residue to the extraprotein 
space can be determined. In addition, the "proximity of a residue to a site of 

20 functionality" or active site and more specifically, the "distance of the a or p carbons 
of the residue to the site of functionality" can be determined. For glycine residues, 
which lack a P carbon, the a carbon can be substituted. Also from the three 
dimensional structure of a protein, the residues that "contact with residues of interest" 
can be determined. These would be residues that are close in three dimensional space 

25 and would be expected to form bonds or interactions with the residues of interest. And 
because of the electron interactions across bonds, residues that contact residues in 
contact with residues of interest can be investigated for possible mutability. 
Additionally, nuclear magnetic resonance spectroscopy can be used to determme the 
structure. Additionally, molecular modeling can be used to determine the structure, 

30 and can be based on an homologous structure or ab initio. Energy minimization 
techniques can also be employed. 

Although not dependent on three dimensional space, the "residue chemistry" 
of each amino acid is influenced by its position in a protein. "Residue chemistry" 
refers to characteristics that a residue possesses in the context of a protem or by itself. 
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These characteristics include, but are not limited to, polarity, hydrophobicity, net 
charge, molecular weight, propensity to fonn a particular secondary structure, and 
space filling size. 

As used herein, the term "carrier" refers to a dUuent. adjuvant, excipient, or 
vehicle. Carriers can be liquids, such as water and oils, including those of petroleum, 
animal, vegetable or synthetic origin, such as peanut oU, soybean oil, mineral oil. 
sesame oil and the Uke. The vehicles {e.g., pharmaceutical vehicles) can be saline, 
gum acacia, gelatin, starch paste, talc, keratin, colloidal silica, urea, and the like. In 
addition, auxiliary, stabilizing, thickening, lubricating and coloring agents can be 
used. When administered to a patient, the carriers are preferably sterile. Water can be 
the carrier when composition is administered intravenously. Saline solutions and 
aqueous dextrose and glycerol solutions can also be employed as liquid vehicles, 
particularly for injectable solutions. Suitable vehicles also include excipients such as 
starch, glucose, lactose, sucrose, gelatin, malt. rice, flour, chalk, silica gel. sodium 
stearate, glycerol monostearate, talc, sodium chloride, dried skim milk, glycerol, 
propyleneglycol. water, ethanol and the like. Compositions, if desired, can also 
contain minor amounts of wetting or emulsifying agents, or pH buffering agents. 

As used herein the term "fimctional domain" means a segment of a protein 
that has one or more of the foUowing properties (i)a structurally independent section 
of a protein, (ii) a section of a protein that is homologous to a section of another 
protein, (iii) a segment of protein involved in one or more specific fimctions, (iv) an 
independently evolving unit in a protein, (v) a segment of protein containing a 
particular sequence motif, (vi) a section of the protein containing an active site, a 
binding site or a regulatory site. See. for example, SuhaU A Islam. Jingchu Luo and 
Michael J E Sternberg, 1995, "Identification and analysis of domains in proteins," 
Protein Engineering 8. 513-525; Orengo al., 1997. "CATH- A Hierarchic 
Classification of Protein Domain Structures." Structure 5. 1093-1 108; and Pearl et al, 
2000. "Assigning genomic sequences to CATH," Nucleic Acids Research 28, 277- 
282, which are each hereby incorporated by reference in their entirety. 
, ' An expert system 100 is computer program that represents and reasons with 
the knowledge of some specialist subject (antibodies) with a view to solving problems 
or giving advice (via rank ordering of substitutions with reasoning) 

Knowledge acquisition is the transfer and transformation of potential problem- 
solving expertise (e.g. knowledge of analysing nucleotide or protein structure. 
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nucleotide or protein phylogeny) from the knowledge soxirce to a program. 

Knowledge base 108 is the encoded knowledge for an expert system 100. In a 
* rule-based expert system 100, a knowledge base 108 typically incorporates definitions 

of attributes and rules along with control information. 
5 An inference engine 106 is software that provides the reasoning mechanism in 

expert system 100. In a rule based expert system 100, it typically implements forward 
chaining and backward chaining strategies. 

A substitution in an antibody is the replacement of one monomer with a 
different monomer. 

10 A virtual surrogate screen is a measure of the activity of an antibody in 

dimensions that are mathematically constructed firom physical measurements of 
antibody properties in two or more assays. 

The terms screen, assay, test and measurement are used interchangeably to 
mean a method of determining one or more property of an antibody. 

15 A high throughput screen, assay, test or measurement is used to describe any 

method for determining one or more property of a plurality of antibodies either 
sequentially of simultaneously. The actual number of antibody variants whose 
properties can be determined by a test that is considered a high throughput screen 
varies from as few as 84 samples per day (Decker et al. (2003) Appl Biochem 

20 Biotechnol 105: 689-703.) to many millions. For the purposes of this invention we 
define a high throughput screen as an assay that can measure one or more antibody 
property for 400 antibody variants in 1 week, preferably a test that can measure one or 
more antibody property for 1 ,000 antibody variants in 1 week, more preferably a test 
that can measure one or more antibody properties for 10,000 antibody variants in one 

25 week. 

5.8 SYNTHESIS OF ANTIBODY SEQUENCE VARIANTS 

Antibody variants can be synthesized by methods for constructing or obtaining 
30 specific nucleic acid or polypeptide sequences described in the art. Antibody variants 
are designed, for example, in step 03 of Fig. 2, as described m Section 5.2, above. 

Oligonucleotides and polyucleotides can be synthesized using a variety of 
chemistries including phosphoraniidite chemistry; optionally this synthesis may be 
performed using a commercially available DNA synthesizer. Oligonucleotides and 
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polynucleotides may also be purchased from a commercial suppUer of synthetic DNA. 

Chemically synthesized oligonucleotides can be incorporated into larger 
polynucleotides to create one or more of the designed sequence variants using site- 
directed mutagenesis. Suitable site-directed techniques include those in which a 
5 template strand is used to prime the synthesis of a complementary strand lacking a 
modification in the parent strand, such as methylation or incorporation of uracil 
residues; introduction of the resulting hybrid molecules into a suitable host strain 
results in degradation of the template strand and replication of the desired mutated 
strand. See (Kunkel (1985) Proc Natl Acad Sci U S A 82: 488-92.); QuikChange™ 
10 kits avaUable from Stratagene. Inc., La JoUa. Calif. PCR methods for introducing 
site-directed changes can also be employed. Site-directed mutagenesis using a single 
stranded DNA template and mutagenic oUgos is well known in the art (Ling & 
Robinson 1997, Anal Blochem 254:157 1997). It has also been shown that several 
oligos can be incorporated at the same tune using these methods (ZoUer 1992. Curr 
1 5 Opin Biotechnol 3 : 348). Single stranded DNA templates are synthesized by 

degrading double stranded DNA (Strandase™ by Novagen). The resulting product 
after strain digestion can be heated and then direcfly used for sequencing. 
Alternatively, the template can be constructed as a phagemid or M13 vector. Other 
techniques of incorporating mutations into DNA are known and can be found in, e.g., 
20 Deng et al. 1992, Anal Biochem 200:81. 

Multiple chemically synthesized oligonucleotides can together be assembled 
into larger polynucleotides to create one or more of the designed sequence variants. 
Oligonucleotides can be assembled into larger single- or double-stranded 
polynucleotides in vivo or in vitro by a variety of methods including but not limited to 
25 annealing, restriction enzyme digestion and ligation, particularly using restriction 
enzymes whose cleavage site is distinct from their recognition sites (see for example 
Pierce 1994, Biotechniques 16:708-15; Mandecki & Boiling 1988. Gene 68:101-7), 
Ugation (see for example Edge at al 1981. Nature 292:756-62; Jayaraman & Puccini 
1992 Biotechniques 12:392-8), ligation followed by polymerase chain reaction 
30 amplification (see for example Jayaraman et al 1991. Proc Natl Acad Sci USA. 
88:4084-8), overlap extension using thermostable nucleotide polymerases and / or 
ligases (see for example Ye et al. 1992, Biochem Biophys Res Commun. 186:143-9; 
Horton et al 1989 Gene. 77:61-8; Steinmer et al 1995 Gene. 164:49-53), dual 
asymmetric PCR (see for example Sandhu et al 1992, Biotechniques 12:14-6) 
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Stepwise elongation of sequences (see for example Majumder 1992, Gene. 1 10:89- 
94), the ligase chain reaction (see for example Au et al 1998, Biochem Biophys Res 
Commun. 248:200-3; Chahners & Cumow 2001, Biotechniques 30:249-52), 
insertional mutagenesis (see for example Ciccarelli et al 1990 Nucleic Acids Res. 
5 1 8 : 1 243-8), the exchangeable template reaction (see Khudyakov et al 1 993, Nucleic 
Acids Res. 21 :2747-54), sequential ligation of one or more oligonucleotides to an 
anchored oligonucleotide (for example a biotinylated oligonucleotide immobilized on 
stieptavidin resin), cotransformation into an appropriate host cell such as mammalian, 
yeast or bacterial cells capable of joining polynucleotides (see for example Raymond 
10 et al 1999, BioTechniques 26: 134-141), or any combination of steps involving the 
activity of one or more of a polymerase, a ligase, a restriction enzyme, and a 
recombinase. Oligonucleotides can optionally be designed to improve their assembly 
into larger polynucleotides and subsequent processing, for example by optimizing 
annealing properties and eliminating restriction sites (see for example Hoover & 
1 5 Lubkowski 2002. Nucleic Acids Res. 30:e43). 

Synthesis of polynucleotide sequence variants can also be multiplexed. 
Individual variants can subsequently be identified, for example by picking and 
sequencing single clones. Other methods of deconvolution include testing for an 
easily measured phenotype (examples include but are not limited to colorigenic, 
20 fluorigenic or turbidity-altering reactions that can be visualized on agar plates), then 
grouping clones according to activity and selecting one or more clone from each 
group. Optionally the one or more clone from each group may be sequenced. 

One example of multiplexed variant synthesis is to incorporate one or more 
oligonucleotides containing one or more alternative nucleotide substitutions into one 
25 or more polynucleotide reference sequences simultaneously. Oligonucleotides 
synthesized from mixtures of nucleotides can be used. The synthesis of 
oligonucleotide libraries is well known in the art. In one aUemative, degenerate oligos 
from trinucleotides can be used (Oaytan, et al., 1998, Chem Biol 5:519; Lyttle, et al 
1995, Biotechniques 19:274 ; Vimekas, et al 1994, Nucl. Acids Res 22:5600 ; 
30 Sondek & Shortle 1 992, Proc. Natl. Acad. Sci. USA 89:358 1 ). In another alternative, 
degenerate oligos can be synthesized by resin splitting (Lahr, et al 1999, Proc. Natl 
Acad. Sci. USA 96:14860 ; Chatellier, et al., 1995, Anal. Biochem. 229:282; and 
Haaparanta & Huse 1995, Mol Divers 1:39). Mixtures of individual primers for the 
substitutions to be introduced by site directed mutagenesis can be sunultaneously 
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employed in a single reaction to produce the desired combinations of mutations. 
Simultaneous mutation of adjacent residues can be accomplished by preparing a 
plurality of oligonucleotides representing the desired combinations. In an alternative 
embodiment, sequences are assembled using PGR to link synthetic oligos (Horton, et 
al 1989, Gene 77:61; Shi, et al 1993, PGR Methods Appl. 3:46; and Cao 1990, 
Technique 2:109). PGR with a mixture of mutagenic oUgos can be used to create a 
multiplexed set of sequence variants that can subsequently be deconvoluted. 

Cassette mutagenesis can also be used in creating multiple polynucleotide 
sequence variants. Using this technique, a set of sequences can be generated by 
ligating fragments obtained by oligonucleotide synthesis. PGR or combinations 
thereof. Segments for ligation can. for example, be generated by PGR and subsequent 
digestion with type H restriction enzymes. This enables introduction of mutations via 
the PGR primers. Furthermore, type H restriction enzymes generate non-palindromic 
cohesive ends which significantly reduce the likelihood of ligating fragments in the 
1 5 wrong order. Techniques for ligating many fragments can be found in Berger. et al.. 

Anal Biochem 214:571 (1993). 

Antibody variants can be synthesized as nucleic acid sequence variants 
according to any of tiie processes described here, followed by expression either in 
vivo or in an m vitro cell-free system. They may also be made directiy using 
20 commercial peptide synthesizers. Antibody variants may additionally be synthesized 
by chemically ligating one or more syntiietic peptides to one or more polypeptide 
segments created by expression of a polynucleotide (see for example Pal et al 2003 

Protein Expr Purif. 29: 1 85-92). 

Antibody variants may optionally include non-natural amino acids, 
25 incorporated at specific positions in tiie protein sequence by a variety of methods (see 
for example Hyun Bae et al 2003, J Mol Biol. 328:1071-81; Hohsaka & Sisido 2002, 
Curr Opin Chem Biol. 6:809-15; Li and Roberts 2003. Ghem. Biol 10:233-9). 

The particular chemical and/or molecular biological methods used to construct 
the antibody sequence variants are not critical; any metiiod(s) tiiat provide the desired 
30 sequence variants can be used. 

5.9 REPRESENTATIVE TESTS FOR ANTIBODY FUNCTION 
Section 5.2 described how a designed set of antibody variants was designed. 
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This set of antibodies is then synthesized using, for example, the techniques described 
in Section 5.8. Then the antibodies are tested for relevant biological activity and/or 
antibody properties. Determination of what constimtes a relevant antibody property is 
a case specific exercise. Non-limiting examples of antibody properties that can be 
5 relevant in some embodiments of the present invention include, but are not limited, to 
antigenicity, immunogenicity, immunomodulatory activity, expression of the antibody 
in a homologous host, expression of the antibody in a heterologous host, expression of 
the antibody in a plant cell, susceptibility of the antibody to in vitro post-translational 
modifications and susceptibility of the antibody to in vivo post-translational 
10 modifications. 

Of particular relevance for this invention are antibody properties whose 
measurements are intensive in their use of such resources as time, space, equipment 
and experimental animals. Such characterizations can be rate limiting for empirical- 
based protein engineering approaches such as those methods applying directed 
1 5 evolution or screening libraries produced by other methods. A common solution to 
this limitation is to develop a high-throughput screen. See, for example, Olsen et al. 
(2000) Curr Opin Biotechnol 1 1 :331-7. 

High throughput screens typically do not measure the complex combination of 
functions that are desired in the final engineered antibody. High throughput screens 
20 can be used to measure some properties of the antibody, and the method of this 
invention allows the properties measured m two or more of these high throughput 
screens to be combmed and used to create a virtual surrogate screen for the properties 
of interest. High throughput screens that may be used to measure potentially relevant 
antibody properties include but are not limited to: flow cytometry (Daugherty et al. 
25 (2000) J Immunol Methods 243: 21 1-27.; Georgiou (2000) Adv Protein Chem 55: 
293-3 15.; Olsen et al. (2000) Curr Opin Biotechnol 11:331 -7.) solid phase digital 
imaging (Joo et al. (1999) Chem Biol 6: 699-706.; Joem et al. (2001) J Biomol Screen 
6: 219-23.), computational and cellular immunogenicity assays (Tangri et al. (2002) 
Curr Med Chem 9: 2191-9.), fluorescence anisotropy (Turconi et al. (2001) J Biomol 
30 Screen 6: 275-90.), flow cytometry, scintillation proximity (Jenh et al. (1998) Anal 
Biochem 256: 47-55.; Skorey et al. (2001) Anal Biochem 291: 269-78.) or magnetic 
bead capture (Yeung et al. (2002) Biotechnol Prog 18: 212-20.) for measurement of 
surface density or binding affinity or avidity, cell surface display (Kim et al. (2000) 
Appl Environ Microbiol 66: 788-93.) [Little], fluorescence polarization assays for 
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measuring protein phosphorylation or other ceUular components (Parker et al. (2000) 
J Biomol Screen 5: 77-88.; Allen et al. (2002) J Biomol Screen 7: 35-44.; 
Kristjansdottir et al. (2003) Anal Biochem 316: 41-9.), assays that link cellular 
survival or growth to protein activity (Luthi et al. (2003) Biochim Biophys Acta 1620: 
167-78.), assays that couple a reaction to a colorimetric or fluorimetric assay 
including two-hybrid or three-hybrid systems (Young et al. (1998) Nat Biotechnol 16: 
946-50.; Baker et al. (2002) Proc Natl Acad Sci U S A 99: 1 6537-42.), a, electrospray 
and matrix adsorption laser desoiption mass spectrometry (LC-MS and MALDI) for 
detection of small molecules and antibodies (Jankowski et al. (2001) Anal Biochem 
290: 324-9.; RaiUard et al. (2001) Chem Biol 8: 891-8.), high performance liquid 
chromatography (HPLC), enzyme-linked immunosorbent assays (Fahey et al. (2001) 
Anal Biochem 290: 272-6.; Mallon et al. (2001) Anal Biochem 294: 48-54.), detection 
of markers for ceUular differentiation (Sottile et al. (2001) Anal Biochem 293: 124- 
8.), induction of a reporter gene in vivo or in vitro (Thompson et al. (2000) Toxicol 
Sci 57: 43-53.), smaU molecule or protein binding competition assays (Warrior et al. 
(1999) J Biomol Screen 4: 129-135.; McMahonetal. (2000) J Biomol Screen 5: 169- 
76.) and time resolved fluorescence (Zhang et al. (2000) Anal Biochem 281: 182-6.). 

Measurements of cell lines and primary cell cultures for cell-surface receptor 
surface density, measurements of cell surface receptor internalization rates. ceU 
surface receptor post-translational modifications including phosphorylation, binding 
of antigens including but not limited to cellular growth factor receptors, receptors or 
mediators of tumor-driven angiogenesis. B ceU surface antigens and proteins 
synthesized by or in response to pathogens, antigens produced by the induction of 
antibody-mediated cell killing, antigens produced by antibody-dependent macrophage 
activity, histamine, and antigens produced by induction of or cross-reaction with anti- 

idiotype antibodies. 

Examples of antibody properties or activities whose measurement may be 
resource, time or cost-limited and that therefore caimot be accurately measured in 
high throughput are tests for the immunogenicity of an antibody, in vivo or cell- 
culture based viral titer measurements, any experiment in which an experimental 
animal or human being is used as a part of the measurement of one or more properties 
of the antibody, the level of expression of the antibody in a host, any experiment in 
which the antibody is produced within a plant particularly when the plant must be 
transformed with a polynucleotide encoding the antibody and the antibody be 
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expressed within the plant, susceptibility of the antibody to be modified inside a living 
cell, susceptibility of the antibody to be modified not inside a living cell, 
measurement of the composition of a complex mixture of compounds whose 
composition has been altered by the action of the antibody (for example metabolomics 
5 or metabonoraics, alteration of the properties of a cell for example alteration of the 
growth, replication or differentiation patterns of a cell or population of cells, 
therapeutic efficacy of an antibody and modulation of a signaling pathway. 

Antibodies of the present invention or fragments thereof can be assayed in a 
variety of ways well-known to one of skill in the art. In particular, antibodies of the 
1 0 invention or fragments thereof can be assayed for the ability to immunospecifically 
bind to an antigen. Such an assay can be performed in solution {e.g., Houghten, 1992, 
Bio/Techniques 13:412 421), on beads (Lam, 1991, Nature 354:82 84), on chips 
(Fodor, 1993, Natiire 364:555 556). in bacteria (U.S. Patent No. 5,223,409), in spores 
(U.S. Patent Nos. 5,571,698; 5,403,484; and 5,223,409), in plasmids (Cull et al, 
15 1992, Proc. Nati. Acad. Sci. USA 89:1865 1869) or in phage (Scott and Smith, 1990, 
Science 249:386 390; Cwirla et al., 1990, Proc. Nati. Acad. Sci. USA 87:6378 6382; 
and Felici, 1991, J. Mol. Biol. 222:301 310) (each of these references is incorporated 
herein in its entirety by reference). 

The antibodies of the invention or fragments thereof can be assayed for 
20 immunospecific binding to a specific antigen and cross-reactivity witii otiier antigens 
by any method known in the art. Immunoassays that can be used to analyze 
immunospecific binding and cross-reactivity include, but are not limited to, 
competitive and non-competitive assay systems using techniques such as Western 
blots, radioimmunoassays, ELISA (enzyme linked immunosorbent assay), "sandwich" 
25 immunoassays, immunoprecipitation assays, precipitin reactions, gel diffiision 
precipitin reactions, immunodiffusion assays, agglutination assays, complement- 
fixation assays, inununoradiometric assays, fluorescent immunoassays, protein A 
unmimoassays, to name but a few. Such assays are routine and well-known in the art 
(see, e.g., Ausubel et al., eds., 1994, Current Protocols in Molecular Biology, Vol. 1, 
30 John Wiley & Sons, Inc., New York, which is incorporated by reference herein in its 
entirety). 

Antibodies of the invention or fragments thereof can also be assayed for their 
ability to inhibit the binding of an antigen to its host cell receptor using techniques 
known to those of skill in the art. For example, cells expressing a receptor can be 
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contacted with a ligand for that receptor in the presence or absence of an antibody or 
ftagment thereof that is an antagonist of the ligand and the ability of the antibody or 
fragment thereof to inhibit the ligand's binding can measured by, for example, flow 
cytometry or a scintillation assay. The ligand or the antibody or antibody fragment 
5 can be labeled with a detectable compound such as a radioactive label {e.g., "P, ^^S, 
and '"l) or a fluorescent label {e.g., fluorescein isothiocyanate, rhodamine, 
phycoerythrin, phycocyanin, allophycocyanin, o-phthaldehyde and fluorescamine) to 
enable detection of an interaction between the ligand and its receptor. Altematively, 
the ability of antibodies or firagments thereof to inhibit a Ugand from binding to its 
10 receptor can be determined in cell-free assays. For example, a ligand can be 

contacted with an antibody or fragment thereof that is an antagonist of the ligand and 
the ability of the antibody or antibody fragment to inhibit the ligand from binding to 
its receptor can be determined. Preferably, the antibody or the antibody fragment that 
is an antagonist of the ligand is unmobilized on a solid support and the ligand is 
1 5 labeled with a detectable compound. Altematively, the ligand is immobilized on a 
solid support and the antibody or fragment thereof is labeled with a detectable 
compound. A ligand can be partially or completely purified {e.g., partially or 
completely free of other polypeptides) or part of a cell lysate. Altematively, a Ugand 
can be biotinylated using techniques well known to those of skill in the art {e.g., 
20 biotinylation kit. Pierce Chemicals; Rockford, IL). 

An antibody or a fragment thereof constructed and/or identified in accordance 
with the present invention can be tested in vitro and/or in vivo for its ability to 
modulate the biological activity of cells. Such ability can be assessed by, e.g., 
detecting the expression of antigens and genes; detecting the proliferation of cells; 
25 detecting the activation of signaling molecules {e.g., signal transduction factors and 
kinases); detecting the effector function of cells; or detecting tiie differentiation of 
cells. Techniques known to tiiose of skill in the art can be used for measuring these 
activities. For example, cellular proliferation can be assayed by ^H-tiiymidine 
incorporation assays and trypan blue cell counts. Antigen expression can be assayed, 
30 for example, by immunoassays includmg, but are not limited to, competitive and non- 
competitive assay systems using techniques such as western blots, 
unmunohistochemistry radioimmunoassays, ELISA (enzyme linked immunosorbent 
assay), "sandwich" immunoassays, immunoprecipitation assays, precipitin reactions, 
gel diffusion precipitin reactions, immunodiffusion assays, agglutination assays, 
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complement-fixation assays, immunoradiometric assays, fluorescent immunoassays, 
protein A immunoassays, and FACS analysis. The activation of signaling molecules 
can be assayed, for example, by kinase assays and electrophoretic shift assays 
(EMSAs). 

5 The antibodies, fragments thereof, or compositions of the invention are 

preferably tested.in vitro and then in vivo for the desired therapeutic or prophylactic 
activity prior to use in humans. For example, assays which can be used to determine 
whether administration of a specific pharmaceutical composition is indicated include 
cell culture assays in which a patient tissue sample is grown in culture and exposed to, 

1 0 or otherwise contacted with, a pharmaceutical composition, and the effect of such 
composition upon the tissue sample is observed. The tissue sample can be obtained 
by biopsy from the patient. This test allows the identification of the therapeutically 
most effective therapy (e.g., prophylactic or therapeutic agent) for each individtial 
patient. In various specific embodiments, in vitro assays can be carried out with 

1 5 representative cells of cell types involved a particular disorder to determine if a 

pharmaceutical composition of the invention has a desired effect upon such cell types. 
For example, in vitro assay can be carried out with cell lines. 

In yet other forms of antibody assays, the effect of an antibody, a fragment 
thereof, or a composition of the invention on peripheral blood lymphocyte counts can 

20 be monitored/assessed using standard techniques known to one of skill m the art 
Peripheral blood lymphocytes counts in a subject can be determined by, e.g., 
obtaining a sample of peripheral blood from said subject, separating the lymphocytes 
from other components of peripheral blood such as plasma using, e.g., FicoU- 
Hypaque (Pharmacia) gradient centrifligation, and counting the lymphocytes using 

25 trypan blue. Peripheral blood T-cell counts in subject can be determined by, e.g., 
separating the lymphocytes from other components of peripheral blood such as 
plasma using, e.g., a use of FicoU-Hypaque (Pharmacia) gradient centrifiigation, 
labeling the T-cells with an antibody directed to a T-cell antigen which is conjugated 
to FITC or phycoerythrin, and measuring the number of T-cells by FACS. 

30 The antibodies, fragments, or compositions of the invention used to treat, 

manage, prevent, or ameliorate a viral infection or one or more symptoms thereof can 
be tested for their ability to inhibit vu:al repUcation or reduce viral load in in vitro 
assays. For example, viral replication can be assayed by a plaque assay such as 
described, e.g., by Johnson et al, 1997, Journal of Infectious Diseases 176:1215-1224 
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176:1215-1224. The antibodies or fragments thereof administered according to the 
methods of the invention can also be assayed for their ability to inhibit or 
downregulate the expression of viral polypeptides. Techniques known to those of 
skill in the art, including, but not limited to, western blot analysis, northern blot 
analysis, and RT-PCR can be used to measure the expression of viral polypeptides. 

Antibodies, fragments, or compositions of the invention can be tested in 
additional in vitro assays that are well-known in the art Such additional In vitro 
assays known in the art can also be used to test the existence or development of 
resistance of bacteria to a therapy. Such in vitro assays are described in Gales et ai, 
2002, Diag. Nicrobiol. Infect. Dis. 44(3):301-311; Hicks et al., 2002, Clin. Microbiol. 
Infect. 8(1 1): 753-757; and Nicholson et al., 2002, Diagn. Microbiol. Infect. Dis. 
44(1): 101-107. 

The antibodies, fragments, or compositions of the invention can be assayed for 
the ability to treat, manage, prevent, or ameliorate a fungal infection or one or more 
symptoms thereof. Any of the standard anti-fimgal assays well-known in the art can 
be used to assess such acitivty. For instance, tests recommended by the National 
Committee for Clinical Laboratories (NCCLS) (See National Committee for Clinical 
Laboratories Standards. 1995, Proposed Standard M27T. ViUanova. Pa., all of which 
is incorporated herein by reference in its entirety) and other methods known to those 
skilled in the art (Pfaller et al.. 1993, Infectious Dis. Clin. N. Am. 7: 435-444) can be 
used. Such antifungal properties can also be determined from a fungal lysis assay, as 
well as by other methods, including, inter alia, growrth inhibition assays, 
fluoi^scence-based fungal viability assays, flow cytometry analyses, and other 
standard assays known to those skilled in the art. 

Further, any in vitro assays known to those skilled in the art can be used to 
evaluate the prophylactic and/or therapeutic utility of an antibody disclosed herein for 
a particular disorder or one or more symptoms thereof 

. The antibodies, compositions, or combination therapies of the invention can be 
tested in suitable animal model systems prior to use in humans. Such animal model 
systems include, but are not limited to, rats, mice, chicken, cows, monkeys, pigs, 
dogs, rabbits, etc. Any animal system well-known in the art may be used. Several 
aspects of the procedure may vary; such aspects include, but are not limited to, the 
temporal regime of administering the therapies (e.g., prophylactic and/or therapeutic 
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agents) whether such therapies are administered separately or as an admixture, and the 
frequency of administration of the therapies. 

Animal models can be used to assess the efficacy of the antibodies, fragments 
thereof, or compositions of the invention for treating, managing, preventing, or 

5 ameliorating a particular disorder or one or more symptom thereof. 

The antibodies, fragments thereof of compositions of the present invention can 
be assayed for their ability to decrease the time course of a particular disorder by at 
least 25%, preferably at least 50%, at least 60%, at least 75%, at least 85%, at least 
95%, or at least 99%. The antibodies, compositions, or combination therapies of the 

10 invention can also be assayed for their ability to increase the survival period of 
organisms {e.g., humans) suffering from a particular disorder by at least 25%, 
preferably at least 50%, at least 60%, at least 75%, at least 85%, at least 95%, or at 
least 99%. Further, antibodies, fragments thereof, compositions, or combination 
therapies of the invention can be assayed their ability reduce the hospitalization period 

15 of humans suffering from viral respiratory infection by at least 60%, preferably at 
least 75%, at least 85%, at least 95%, or at least 99%. 

The toxicity and/or efficacy of the antibodies, fragments thereof, or 
compositions of the present invention can be assayed by standard pharmaceutical 
procedures in cell cultures or experimental ammals, e.g., for determining the LD50 

20 (the dose lethal to 50% of the population) and the ED50 (the dose therapeutically 
effective in 50% of the population). The dose ratio between toxic and therapeutic 
effects is the therapeutic index and it can be expressed as the ratio LD50/ED50. 
Antibodies that exhibit large therapeutic indices are preferred. While antibodies that 
exhibit toxic side effects can be used, care should be taken to design a delivery system 

25 that targets such agents to the site of affected tissue in order to minimize potential 
damage to uninfected cells and, thereby, reduce side effects. 

Technological advances in the friture may make it possible to measure in 
higher throughput properties that can currently be measured only in low throughput. 
One skilled in the art will readily see that the methods of this invention may be used 

30 to correlate any antibody properties that are not easily measured with a high- 
throughput assay with other properties that are readily measured in high throughput. 
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5.10 KITS 

The invention provides kits comprising a set of variant or a single variant in a 
set of variants that have been refined by the apparatus and methods describe herein. 

The invention also provides a pharmaceutical pack or kit comprising one or 
more containers filled with a variant of set of variants of the present invention. The 
pharmaceutical pack or kit may further comprise one or more other prophylactic or 
therapeutic agents useful for the treatment of a particular disease. The invention also 
provides a pharmaceutical pack or kit comprising one or more containers filled with 
one or more of the ingredients of the pharmaceutical compositions of the invention. 
Optionally associated with such container(s) can be a notice in the form prescribed by 
a governmental agency regulating the manufacture, use or sale of pharmaceuticals or 
biological products, which notice reflects approval by the agency of manufacture, use 
or sale for himian administration. 

5.11 ARTICLES OF MANUFACTURE 

The present invention also encompasses a finished packaged and labeled 
pharmaceutical product. This article of manufacture includes the appropriate unit 
dosage form in an appropriate vessel or container such as a glass vial or other 
container that is hermetically sealed. In the case of dosage forms suitable for 
parenteral administration the active ingredient is sterile and suitable for administration 
as a particulate free solution. In other words, the mvention encompasses both 
parenteral solutions and lyophilized powders, each bemg sterile, and the latter being 
suitable for reconstitution prior to injection. Alternatively, the unit dosage form may 
be a solid suitable for oral, transdermal, topical or mucosal delivery. 

In a preferred embodiment, the unit dosage form is suitable for mtravenous, 
intramuscular or subcutaneous delivery. Thus, the invention encompasses solutions, 
preferably sterile, suitable for each delivery route. 

As with any pharmaceutical product, the packaging material and container are 
designed to protect the stability of the product during storage and shipment. Further, 
the products of the invention include instructions for use or other informational 
material that advise the physician, technician or patient on how to appropriately 
prevent or treat the disease or disorder in question. In other words, the article of 
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manufacture includes instructioa means indicating or suggesting a dosing regimen 
including, but not limited to, actual doses, monitoring procedures (such as methods 
for monitoring mean absolute lymphocyte counts, tumor cell counts, and tumor size) 
«r and other monitoring information. 

5 More specifically, the invention provides an article of manufacture comprising 

packaging material, such as a box, bottle, tube, vial, container, sprayer, insufflator, 
intravenous (i.v.) bag, envelope and the like; and at least one unit dosage form of a 
pharmaceutical agent contained within said packaging material. The invention further 
provides an article of manufacture comprising packaging material, such as a box, 

1 0 bottle, tube, vial, container, sprayer, insufflator, intravenous (i.v.) bag, envelope and 
the like; and at least one unit dosage form of each pharmaceutical agent contained 
within said packaging material. 

In a specific embodiment, an article of manufacture comprises packaging 
material and a pharmaceutical agent and instructions contained within said packaging 

1 5 material, wherein said pharmaceutical agent is a humanized antibody and a 

phaimaceutically acceptable carrier, and said instructions indicate a dosing regimen 
for preventing, treating or managing a subject with a particular disease. Iii another 
embodiment, an article of manufacture comprises packaging material and a 
pharmaceutical agent and instructions contained within said packagmg material, 

20 wherein said pharmaceutical agent is a humanized antibody, a prophylactic or 
therapeutic agent other than the humanized antibody and a pharmaceuticaily 
acceptable carrier, and said instructions indicate a dosing regimen for preventing, 
treating or managing a subject with a particular disease. In another embodiment, an 
article of manufacture comprises packaging material and two pharmaceutical agents 

25 and instructions contained within said packaging material, wherein said first 

pharmaceutical agent is a humanized antibody and a pharmaceuticaily acceptable 
carrier and said second pharmaceutical agent is a prophylactic or therapeutic agent 
other than the humanized antibody, and said instructions indicate a dosing regimen for 
preventing, treating or managing a subject with a particular disease. 

30 The present invention provides that the adverse effects that may be reduced or 

avoided by the methods of the invention are indicated in informational material 
enclosed in an article of manufacture for use in preventing, treating or ameliorating 
one or more symptoms associated with a disease. Adverse effects that may be 
reduced or avoided by the methods of the invention include but are not limited to vital 
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sign abnormalities (e.g., fever, tachycardia, bardycardia, hypertension, hypotension), 
hematological events (e.g., anemia, lymphopenia, leukopenia, thrombocytopenia), 
headache, chills, dizzmess, nausea, asthenia, back pain, chest pain (e.g., chest 
pressure), diarrhea, myalgia, pain, pruritus, psoriasis, rhinitis, sweating, injection site 
reaction, and vasodilatation. Since some of the therapies may be immunosuppressive, 
prolonged immunosuppression may increase the risk of infection, including 
opportunistic infections. Prolonged and sustained immunosuppression may also result 
in an increased risk of developing certain types of cancer. 

Further, the information material enclosed in an article of manufacture can 
indicate that foreign proteins may also result in allergic reactions, including 
anaphylaxis, or cytosine release syndrome. The information material should indicate 
that allergic reactions may exhibit only as mild pruritic rashes or they may be severe 
such as erythroderma. Stevens Johnson syndrome, vascuUtis, or anaphylaxis. The 
information material should also indicate that anaphyUctic reactions (anaphylaxis) are 
serious and occasionally fetal hypersensitivity reactions. Allergic reactions including 
anaphylaxis may occur when any foreign protein is injected into the body. They may 
range from mild manifestations such as urticaria or rash to lethal systemic reactions. 
Anaphylactic reactions occur soon after exposure, usually within 10 minutes. Patients 
may experience paresthesia, hypotension, laryngeal edema, mental status changes, 
facial or pharyngeal angioedema, airway obstruction, bronchospasm. urticaria and 
pruritus, serum sickness, arthritis, allergic nephritis, glomenilonephritis. temporal 
arthritis, or eosinophilia. 

6. EXAMPLES 

6 1 ENGINEERING A PROTEIN (PROTEINASE K) USING EXPERT 
SUBSTITUTION SELECTION METHODS AND SEQUENCE-ACTIVITY 

RELATIONSHIPS 

The design, synthesis and analysis of sequence variants of proteinase K is 
described here as an example of the use of sequence-activity relationships to engineer 
desired properties into a protein. Also described is tiie analysis of tiiese variants using 
different functional tests, and methods for determining components of a virtual 



six 
screen. 
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Fig. 6 shows the amino acid sequence of proteinase K that occurs naturally in 
the fungus Tritirachium album Limber (Gunkel et al. (1989) Eur J Biochem 179: 185- 
194) (SEQ ID NO.: 2) together with an E. coli leader peptide (SEQ ID NO.: 1). Fig. 7 
shows a nucleotide sequence designed to encode proteinase K (SEQ ID NO.: 3). The 
5 sequence has been modified fi"ora the original Tritirachium album sequence by 

removing an intron, adding an E. coli leader peptide and altering the codons used to 
resemble the distribution foimd in the highly expressed genes of E coli. The gene was 
synthesized for the natural proteinase K from oligonucleotides. 

Several different criteria were used to identify positions and substitutions to 
10 make in the proteinase K sequence as detailed below 

6.1.1 Principal Component Analysis to Identify Substitutions that may 
Contribute to Thermostability 

1 5 The proteinase K gene was used as probe against GenBank using BLAST 

based algorithms. A BLAST score was chosen as a cut-off that identified more than 
ten but less than one hundred related sequences. This search identified the 49 
sequences identified in Fig. 8. 

The sequences (49 rows x 728 variables) were represented in a Free- Wilson 

20 method of quahtative binary description of monomers (Kubinyi, 3D QSAR in drug 
design theory methods and applications. Pergamon Press, Oxford, 1990, pp 589-638), 
and distributed in a maximally compressed space using principal component analysis 
so that the first principal component (PC) captured 10.8 percent of all variance 
information (eigenvalue of 79), the second principal component (PC) captured 7.8 

25 percent of all variance information (eigenvalue of 57), the third principal component 
(PC) captured 6.9 percent of all variance information (eigenvalue of 50), the fourth 
principal component (PC) captured 6.2 percent of all variance information 
(eigenvalue of 45), the fifth principal component (PC) captured 5.4 percent of all 
variance information (eigenvalue of 39) and so on until 728"* principal component 

30 (PC) captured 0 percent of all variance mformation (Eigenvalue 0). 

All sequences were plotted in the first six principal components, which 
captured a total of 42 percent of all variance information present in the 728 
dimensions. Sequences 46, 47, 48, 49 are all derived from thermophilic organisms 
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and are all well separated from the proteinase K homologs 1-45 in both of the furst 
two principal components, as shown in Fig. 9. 

A corresponding plot of all loads describes the influence of each variable on 
the sample distribution in the various PCs. The correlation between loads (influence 
of variables - in this case amino acid residues) and score (distribution of samples - 
here proteinase K homologs) illustrates graphically which residues are unique in 
detennmingthephylogeneticseparationofgenes46-49from genes 1-45. This is 

shown in Fig. 10. 

Subsequently, the lower left comer of the bottom left quadrant of the loads 
plot was magnified and the variables labeled (Fig. 1 D- By adding the PCI and PC2 
value for each variable one can rank order the influence of each residue for their 
reciprocal effect on sample distribution. This distribution of residue effects can be 
due to common ancestral history or can be due to fimctional constraints among this 
group of samples. 

As can be seen m Fig. 11. residues that are completely co-evolving (due to 
sampling effects, phylogenetic ancestry or other) will have the exact same load and 
consequently collapse the variable space in as many dunensions as there are absolute 
coevolving residues. This is illustrated in the graph where residues 15D , 18D , 19Q . 
22L, 23P,'65Y, 66D. 1 lOR, 137P, 164D, 189C. 198R aU are completely co-evolving 
and all have profound effect on the distribution of samples 46-49 in PCI and PC2. 
After removing residues that are unique for only one of the extreme samples, residues 
that are common to the thermophiles but unique to one individual were retained and 
fiirther explored. Variables here can be amino acids as depicted in this example, or 
any type of feature. Features include, but are not limited to. physico-chemical 
properties of one or more amino acid residues. The residues can be a block or 
modulated withm the gene, or it can be a combination of residues not genetically 
linked such as in the example above of residues 15D, 18D, 19Q, 22L, 23P, 65Y, 66D, 
110R,137P.164D,189C,198R. 

The loads for the amino acids most responsible for the clustering of 
thermophilic protemase K homologs are shown in Fig. 12. This information was then 
incorporated into knowledge base 108. This is an example of pre-processing 
information. 
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6.12 Structural Informatioo for Homologous Enzymes 

The BLAST search of Genbank for proteinase K homologs also revealed that 
proteinase K is homologoxis to subtilisin and other serine proteases. Subtilisin in 

5 particular has been extensively studied. The structures of naturally occurring and 
variant subtilsins have been obtained, and there is a large body of data regarding the 
functional effects of a substantial number of mutations. See, for example, Bryan, 
2000, Biochim Biophys Acta 1543:203-222. Sequence and structural alignments of 
proteinase K with subtilisin allowed for the identification of homologous positions in 

10 proteinase K having changes known to improve activity or thermostabilize subtilisin. 
This information was incorporated into the knowledge base 108. This is an example 
of pre-processing information. 

6.13 Sequence Information from Thermostable Close Homologs 

Amongst the closest ten homologs of proteinase K identified by BLAST 
search of Genbank, are enzymes known to be thermostable. These enzymes were 
aligned positions that were conserved between the thermostable homologs but not 
found in non-thermostable homologs were identified. This information was tiien 
incorporated into the knowledge base 108. This is an example of pre-processing 
information. 

6.1.4 Sequence Information from Close Homologs 

One of tiie homologs identified in flie BLAST search was highly related to 
proteinase K (>95% sequence identity) and also tiiermostable. The sequence of tiiis 
protein was aligned with proteinase K and all amino acid changes between flie two 
enzymes were identified. This information was then incorporated into the knowledge 
base 108. This is an example of pre-processing information. 
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30 6.1.5 Information Processing 

Using tiie information described above tiiat was placed in knowledge base 

108, the following rules 120 were defined. 

(a) Changes that are aheady present in proteinase K were eliminated. 

(b) Changes tiiat occur in tiie pro-region of the protein were eliminated 
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(c) A score proportional to the load from the PCA analysis was added. 

(d) A score for conservative changes was added. 

(e) A score for changes found in a close homolog (>95% identical) was added. 

(f) A score for change found in a close homolog that is thermostable but not in 
5 close homologs that are not thermostable was added. 

An initial sequence space of 24 residues was defined using rules (a) through 
(f). Changes with the top 24 scores were picked. TTiese residues are shown in Fig. 
13. 

These variations and all combinations of these variations encompass a 

1 0 sequence space of over a million different sequences. To reduce the number of 

variants to test in the first set of variants a design based on prior knowledge and single 
site statistics considerations was used (Fig. 2, step 03). 

Based on information about the plasticity of serine proteases and subtilisin 
genes, variants with six changes per clone were designed. In this example all of the 

1 5 24 top-scoring changes were equally represented. In other embodiments, a set of 
variants that represent each change with a frequency reflecting its actual score could 
have been designed. In this case, 24 clones were designed that cover the sequence 
space uniformly. One way to measure the uniformity of the space covered is by 
counting the number of instances a particular substitution (e.g., N95C) is seen in the 

20 24 clones. This number was set at six for all the variations identified. This means, 
that in the set of variations syntiiesized, each of the identified mutations occurs six 
times. For example, tiie mutation N95C is found in six of die variants, die mutation 
P97S is found in six of the variants, and so forth. 

The variants defined by tiiis process are listed in Fig. 14, where Fig. 13 serves 

25 as the key for Fig. 14. For example, "95 m Fig. 14 means «N95C", "355 in Fig. 14, 
means "P335S". 

The polynucleotides encoding each proteinase K variant defmed in Fig. 14 
were syntiiesized by PCR-based assembly of syntiietic oligonucleotides. The 
sequence of each variant was confirmed using an ABI sequencer. The ability of each 
30 of fliese variant proteins to hydrolyze casein was then measured simply to determine 
whetiier the protemase K variants had any protease activity. This is tiie first step in 
exploring the sequence space. (Fig, 2, step 04). 

This data, as well as data measuring the activity of proteinase K towards the 
hydrolysis of polylactide, can be used to analyze die data using sequence-activity 
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conelating methods to evaluate the substitutions (steps 05 and 06 of Fig. 2). In turn, 
this infomation can be used to update knowledge base 108 and to perform additional 
iterations of the method to thus further explore the sequence space for improvements 
in desired properties. 

5 Preliminary data indicated that changes at residues 95, 97, 138, 208, 236, 237, 

265 and 299 were foimd only in poorly performing variants. Changes at residues 123, 
145, 167, 273, 293, 310, 332, 337 and 355 were found in medium performing 
variants. Changes at residues 107, 151, 180, 194, 199 and 267 were found in high 
performing variants. Using this information the next roxmd of sequence sets was 

1 0 designed and is shown in Fig. 15. 

Additionally, from the results of the experiments, expert system 100, in 
conjunction with the sequence-activity correlating methods inferred that the proline to 
serine change (seen at positions 97 and 265) for flexibility and structural perturbation 
twice resulted m disadvantageous changes. This information was coded into the 

15 knowledge base 108 for future experiments. This is one illustration of updating 
knowledge base 108. 

The sequence of each constructed variant is shown in Fig. 16. The activity of 
the variants towards casein, which is a large polymeric substrate like polylactide, was 
measured. Variant activity towards a modified tetrapeptide, iS^-succinyl-Ala-Ala-Pro- 

20 Leu-/?-nitroanilide (AAPL-p-NA) which undergoes a colorimetric change upon 

protease-mediated hydrolysis (Sroga et al. (2002) Biotechnol Bioeng 78: 761-9), was 
also measured. Using this substrate, the activity of the variants at three different pH 
values (7, 5.5 and 4.5) was measured. The activity of variants following a five minute 
heat treatment at 65°C was also measured. The activities observed for each property 

25 measured are shown in Fig. 1 7. 

For each of the proteinase K activities tested, a partial least squares regression 
(PLSR) was used to model the relationship between amino acid substitution and 
proteinase activity (the sequence-activity relationship) for variants 10-49. The 
application of these methods to nucleic acids, peptides and proteins has been 

30 described previously. See, for example, Geladi et al.,1 986, Analytica Chimica Acta 
186: 1-17; Hellberg et al., 1987, J Med Chem 30: 1 126-35.; Eriksson et al., 1990, 
Acta Chem Scand 44: 50-55; Jonsson et al., 1993, Nucleic Acids Res. 21: 733-739; 
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Norinder et ah, 1997, J Pept Res 49: 155-62.; Bucht etal, 1999. Biochiin Biophys 
Acta 1431: 471-82. 

The PLSR-based sequence activity model was used to assign a regression 
coefficient to each varied amino acid. The predicted activity for a proteinase K 
variant was calculated by summing the regression coefficients for amino acid 
substitutions that are present in that variant. In this case, terms to account for 
interactions between the varied amino acids were not included, although this can also 
be done. See, for example. Aita et al, 2002. Biopolymers 64: 95-105. Fig. 18 shows 
a correlation between the predictions of the sequence-activity model and the measured 
ability of heat-treated proteinase K variants to hydrolyze AAPL-p-NA. 

The utUity of the sequence-activity model was tested for its ability to predict 
the activity of variants that have not been measured, or to identify amino acid 
substitutions that contribute positively to a specific.protein property and that can then 
be experimentally combined. To test the sequence activity model for heat-tolerant 
hydrolyzers of AAPL-p-NA, the regression coefficients from the model were tested, 

as shown in Fig. 19. 

Four of the amino acid changes had been incorporated into the variants were 
predicted to have a positive effect on the activity of proteinase K after heating. These 
were K208H, V267I, G293A and K332R. Among the variants synthesized in the 
initial set of 48. one (NS40) contained three of these changes (V267I. G293A and 
K332R) and one (NS19) contained the other (K208H). To test the predictive power 
of our model, a variant (NS56) containing all four of these changes was synthesized 
and its activity was compared with that of NS19 and NS40. 

As shown in Fig. 20. combining the four changes identified by the PLSR 
model produced a variant with greater post-heat treatment activity towards AAPL-p- 
NA than the single or triple changes. By synthesizing and measuring the activities of 
only 48 variants a new variant that was further improved for measured activity was 
designed. This demonstrates that the combination of low-throughput screening and 
mathematical analysis is useful for protein engineering. 

The current paradigm for empirical protein engineering is to employ high 
throughput screens to test libraries of thousands of variants. See. for example. Lin et 
al, 2002, Angew Chem Int Ed Engl 41 : 4402-25. In general, high throughput screens 
do'not measure all of the properties that are important for the final application. One 
common way of overcoming this discrepancy is the use of tiered screens, in which 
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high throughput screens that measure only one or two of the properties of interest are 
followed by lower throughput screens that more accurately reflect the desired protein 
characteristics. See, for example, Ness et al., 2000, Adv Protein Chem 55: 261-292. 
r This technique relies on the assumption that the high throughput primary screen will 

5 identily the amino acid substitutions that are important for the final fimction but will 
also select some false positives. False positives do not actually contribute to the final 
function and are eliminated by subsequent screens. The alternative possibility, that 
amino acids that would be beneficial for the final application may be missed by the 
initial high throughput screen (false negatives), is seldom considered. By prematurely 

1 0 discarding substitutions that would be beneficial for the desired function, the protein 
engineering process may be unnecessarily prolonged or even fail. 

Having measured several properties of the proteinase K variants described 
above and vaUdated the predictive power of the sequence-activity modeling of the 
present invention, the validity of the high througl^ut screening approach was 

15 explored in more depth. Although no high throughput screening was explored in this 
example, all of the assays described above could easily be adapted for use as high 
throughput primary screens. Hydrolysis of casein incorporated into media plates has 
been used as a primary screen for protease libraries. See, for example, Ness et al, 
1999, Nat Biotechnol 17: 893-896; Ness et al, 2002, Nat Biotechnol 20: 1251-5. 

20 Hydrolysis of AAPL-p-NA has also been described. See, for example, Sroga et al, 
2002, Biotechnol Bioeng 78: 761-9. Testing AAPL-p-NA hydrolysis at lowered pH 
(5.5 or 4.5) might be considered an appropriate surrogate for the low pH tolerance that 
will be required by an enzyme that is producing lactic acid from polylactide. 
Similarly testing AAPL-p-NA hydrolysis following heat treatment may measure the 

25 stability that will be required for an enzyme that must resist the thermal stresses of 
incorporation into a plastic. Thermostability was expressed in three ways: (i) as the 
absolute level of activity remaining following heat treatment, (ii) as the activity 
remaining relative to the activity prior to heat treatment, and (iii) as the product of 
these two values. Having obtained values for each of these proteinase properties, the 

30 correlation between the properties was examine, and the amino acid substitutions that 
J would be selected by each screen were compared. 

Three representative activities were selected for further analysis: (i) activity 
towards AAPL-/^•NA at pH 7.0, (ii) absolute activity towards AAPL-/7-NA following 



111 



o o 

wo 2005/012877 PCT/US2004/024751 

five minutes at 65°C, and (iii) activity towards casein. For each of these activities 
PLSR models similar to that shown in Fig. 18 were constructed, and the regression 
coefficients for each amino acid substitution were calculated as shown for thermal 
tolerance in Fig. 19. The changes calculated to contribute positively to each property 

are shown in Fig. 2 1 . 

The difference between beneficial amino acids selected by the three different 
representative assays is strikmg. Use of any of these measurements as the primary 
assay would select some amino acid changes that are not important for the others. 
These would be false positives, for example, use of casein hydrolysis as a primary 
screen would identify six changes (S107D. S123A. V167I, Y194S, A199S and 
S273T) that have a negative effect on activity towards AAPL-p-NA. with or without 
heating. Perhaps even more importanUy, the casein primary screen would have 
falsely attributed a negative value to three of the four changes important for thermal 
tolerance (K208H. V267I and G293A). 

This failure of a tiered screening strategy is not simply a result of selecting an 
inappropriate surrogate substrate. Similar results would have been seen had activity 
towards AAPL-/7-NA been used as a primary screen followed by a test for thermal 
tolerance. In this case half of the beneficial changes would stiU have been discarded 
as false negatives (K208H and V267I). This analysis shows that measuring properties 
that are different from those of the final application can result both in incorporation of 
sequence changes that do not contribute to the desired phenotype. as weU as omission 
ofthosethatdo. 

A method for engineering proteins based on design, synthesis and testing of 
small numbers of individual variants followed by mathematical modeling to 
determine a sequence-activity relationship has been described. Sequence-activity 
models that can be used predictively to design improved variants have also been 
described. 

By incorporating the principles of experimental design, individual design and 
synthesis of sequence variants allows a more efficient search of sequence space than a 
library approach (Hellberg et al. (1991) Int J Pept Protein Res 37: 414-424). Another 
advantage of the modeling approach is that it facilitates empirical protein engineering 
but requires only very low numbers of variants to be tested. This means that the need 
for high throughput screens is obviated. This analysis indicates that high throughput 
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20 



25 



and tiered screening can be fundamentally flawed strategies for protein engineering. 
Both conserved reaction conditions and use of the same substrate appear susceptible 
to selection of false positives and rejection of false negatives. The performance of 
high throughput screens will be further compromised when the primary screen is 
selected on the basis of throughput rather than faithfiil replication of the final 
application. 

6.2 roENTIFYING A SET OF SUBSTITUTIONS AND DEFINING A SET OF 
VARIANTS REPRESENTING THAT SEQUENCE SPACE FOR ANTIBODIES 
WITH IMPROVED NEUTRALIZATION OF RESPIRATORY SYNCITIAL 

VIRUS 

In this example, the optimization procedures of the present invention are 
illustrated for an antibody that binds and neutralizes Respiratory Syncytial Virus 
(RSV). The sequence of one such antibody is publicly available (Genbank accession 
# AAF21612). A significant benefit of the computational antibody design system 
using the methods described in this invention is that only relatively small numbers of 
variants need to be synthesized and tested. This allows the use of functional tests that 
are more comprehensive tiian binding assays. Viral neutralization for example, is an 
important antibody function but the sequence and structural determinants are poorly 
understood. 

Methods used to identify substitutions in the framework and CDR regions of 
the heavy chain of the AAF21612 antibody sequence are as follows. The sequence of 
the heavy chain of the AAF21612 antibody was aligned using the kabat numbering 
system with germline human ig heavy chain sequences retrieved from the VBase 
database. A total of 49 sequences were aligned. This alignment may not limited to 
germline human sequences. Alternatively, all sequences that are in the same structural 
class as AAF21612 as defined by Chothia and Lesk (Chothia and Lesk, 1986, EMBO 
Journal 5, 823-826) can be used. 

These 49 sequences were processed and substitutions scored according to a 
modified version of the scheme shown in Fig. 3. The modified process is shown in 
Fig. 22. 

Rule la. Align the sequences using kabat numbering and select all 
substitutions found in any of the germline sequences. Classify the substitutions into 
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two categories: (i) substitutions found in the framework region and (ii) substitutions 
found in the CDR. 

Rule lb. Reconstruct a phylogenetic tree using the Clustal W software based 
on the amino acid alignment in the framework region. For each substitution, calculate 
the evolutionary proximity of the closest germline in which that substitution occurs. 
The evolutionary proximity EP is calculated as follows: 



p = njn 

10 where, 

p is the p-distance, 

m is the number of amino acid differences between two sequences; and 
n is the total number of amino acids in the protein. 

15 Further, 

d = -ln(l-p) 

where, 

d is the Poisson-corrected p-distance between two sequences; and 
20 ln(l-p) is the natural logarithm of the p-distance. 

And, 

EP = 1/d 

25 where, 

EP is tlie evolutionary proximity. 



Rule Ic. For each substitution in the framework group and in the CDR, 
30 calculate the favorability of that substitution using a PAMl 00 matiix. 



SM = PAM(Ao,As)/10 



where. 
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Ao is the original amino acid at a position. 
As is the substitution amino acid, and 

PAM(Ao, As) is a measure of the average probability that A© is 
substituted with As in a large set of protein homolog families. 

5 

Rule 2b. For each position, calculate the site heterogeneity, that is a measure 
of the number of different amino acids present at that position. The site heterogeneity 
is calculated as the number of different amino acids seen at a position in the set of 
homologs (SH). 

10 

Rule 3b. For each position calculate the site entropy as follows: 
SE= -S{(PAi/N)xln(PAi/N)} 

where, 

IS N is the number of homologous sequences, 

Pa! is the number of times amino acid i occurs at position P, 
ln(PAi/N) is the natural log of Pai/N, and 
S is the sum for all amio acids for position P. 

20 Rule 4b. For each substitution count the number of times it occurs in the set of 

homologs (SN) 

The total score is then calculated for framework and CDR region substitutions 
as follows: 

25 

Scorepw = f(EP) x f(SH) x f(SE) x f(SN) x f(SM). 

where fQ is a mathematical function. In this case the function was the parameter in 
the parentheses multiplied by 1, but the use of functions allows different weights to be 
30 applied in subsequent cycles. 

ScorecDR = f (SE) x f (SN) x f (SM), where f 0 is a mathematical function. 
In this case the function f Q was the parameter in the parentheses multiplied by 1, but 
the use of functions f 0 allows different weights to be applied in subsequent cycles. 
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Based in the above scores, twenty substitutions in both the CDR and 
framework were identified. The results from use of this substitution-scoring scheme 
is shown in Table 1 : 



Table 1 



Framework substitutions 


CDR substi 


tutions 


IC78R 


0.465651 




V30M 


35.63365 


H73Q 


0.398614 




D65N 


35.55048 


S79A 


0.389751 




G51S 


32.24937 


1 na\/ 
LOoV 






N31S 


30.06633 


S24T 


0.345752 




L52aY 


30.05984 


L01V 


0.338391 




L52aN 


9.380159 


S20A 


0.337918 




N31H 


25.66902 


G26S 


0.337206 




E56K 


1 25.53363 


D27S 


0.333916 




D65T 


22.22917 


V45I 


0.321903 




F33V 


21.71887 


L42V 


0.311439 




A53D 


21.88011 


C19A 


0.280519 




A53P 


19.17291 


S68N 


0.279479 




A53Q 


12.47777 


M74L 


0.258614 




F59V 


16.86972 


N75S 


0.254877 




V55L 


16.06146 


I69T 


0.243678 




G51N 


13.5927 


T21S 


0.238389 




E56Q 


11.17192 


R13S 


0.227712 




V55F 


10.78488 


V86R 


0.221026 




F33I 


9.900269 


G85A 


0.21849 




S62T 


6.950517 



10 



15 



20 



A set of forty variants were then designed with the following criteria: 

1 . Include five substitutions in each variant 

2. Maximize the number of different pairs of substitutions that occur. If each 
variant contains five substitutions, it contains ten sets of pairs. There i^ thus a 
maximum of 400 pairs represented in forty variants. The variant set below was 
optimaUy design to maximize the number of pairs observed. 

In addition, the relative number of framework versus CDR substitution can be 
modulated. A maximum number of framework and/or CDR substitutions in a variant 
is set. 

This set was calculated by in silico evolution. An initial set of variants each 
containing five substitutions was randomly chosen. Substitutions were then altered 
randomly. If a change increased the number of substitution pairs in the variant set it 
was accepted. Otherwise it was rejected. The process continued for 10.000 iterations. 
The fmal set of variants is shown in Table 2. 
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Table 2. 



vanani- 1 


1 mv 

Uw 1 V 


Q9nA 




T21S 


F59V 


\ /aria nf_0 




L.H^ V 




S68N 


G85A 


Vol (aiU~0 




LUOV 


L01V 


R13S 


G85A 


Vol lOI IL-*f 


557QA 


IN / 


V s^WtVl 


F59V 


V55F 


Vol loll 


INO 1 o 


PIQA 




A53D 


V86R 


Vol lolH'^ 


II ( 


OUOIN 


R13S 


V30M 


A53P 


vanani-f 


f\/ or\ 


IVI f *4 L 


vours 


G8SA 


G51N 


vanani-o 


1 nnv/ 


Q9nA 






F'^'^l 


vanani-s 










l^ij 1 0 


vanani-1 u 




M7^l 


IN r 


r\ 1 Ow 


A'l'^O 


vanant-1 1 


LUl V 


VODK 


1 C9aM 




X/'^'^l 


variant- i^t 




Nolo 


1 ^9aV 
LOZa T 


1 0 


r«jjj V 


Vanant-1 3 


1 nQ\/ 


UOO 1 




1 *^^aY 

Uu^d T 


roov 


vanant-1 4 


OCQKI 

oDoN 


uOlo 


royv 




rooi 


Vanant-1 5 


Kf OK 


n/oU 


0/57M 




rooi 


Vanant-1o 








UN 




Vanant-1 7 






IN f 00 




iNv 1 n 


Vanant-1 o 


V40I 


1 /1o 




VOUIVI 




Vanant-1 8 


KZoK 




UOO IN 


09 1 0 




Vanant-20 


K78R 


UZio 


UDO 1 


roov 


OQZ 1 


Variant-21 


S79A 


L42V 




VODL 


oDl IN 


Variant-22 


lvi74L 


lo9l 


UOO 1 


ccen 


rOOI 


Variant-23 


S24T 


LOIV 


IfiCkT 




A'^'^P 


Variant-24 


V45I 


L42V 


IVI/hL 


iNoin 


A>>')P 


Variant-25 


L42V 


I69T 


1 ^1o 


voorv 




Variant-26 


S20A 


Id9T 


VoUm 




P'^'iV 

roov 


Variant-27 


G26S 


CCQKI 

ODON 


Lo^ai 


cDOr\ 


1 


Variant-28 


C19A 


\ /CCD 

VooK 


roov 


A coo 


P<nQV 
rOSI V 


Variant-29 


H73Q 


L08V 


No in 


VOOL 


1 


Variant-30 


K78R 


1 r\o\i 






vi?or 


Vanant-31 


S20A 








US'?? 


VolloiU'w^ 


S79A 


S24T 


S68N 


T21S 


A53D 


Variant-33 


L42V 


R13S 


D65N 


V55F 


F33r 


Variant-34 


D27S 


G85A 


G51S 


L52aN 


N31H 


Variant-35 


N75S 


T21S 


F33V 


A53P 


S62T 


Variant-36 


R13S 


L52aY 


F33V 


V55L 


E56Q 


Variant-37 


LOW 


V45I 


S68N 


V55F 


S62T 


Variant-38 


L08V 


S24T 


C19A 


V30M 


E56Q 


Variant-39 


S79A 


D27S 


C19A 


M74L 


N31S 


Variant-40 


H73Q 


S24T 


D27S 


V86R 


D65N 



6.3 IDENTIFYING A SET OF SUBSTITUTIONS AND DEFINING A SET OF 
5 VARIANTS REPRESENTING THAT SEQUENCE SPACE FOR 

HUMANIZING AND OPTIMIZING MURINE ANTIBODIES FOR 
NEUTRALIZATION OF RESPIRATORY SYNCITIAL VIRUS 



In this example, a humanization procedure for a murine antibody RSV19 that 
10 binds and neutralize RSV (Respiratory Syncytial Virus) is illustrated. A significant 
benefit of the computational antibody design system using the methods described in 
this invention is that only small numbers of variants will be synthesized and tested. 
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- This allows the use of functional tests that are more complicated than selection for 
binding. Antibody humanization is an important antibody function but the sequence 
and structural determinants are poorly understood. 

The methods used to identify substitutions in the framework and CDR regions 

5 of the heavy chain of the RSV-19 antibody sequence are as follows. The sequence of 
the heavy chain of the RSV-19 antibody was aligned using the kabat numbering 
system with germline human ig heavy chain sequences retrieved from VBase database 
This alignment may not Umited to germline human sequences. Alternatively, all 
human antibody sequences that are in the same structural class as AAF21612 as 

0 defined by Chothia and Lesk (Chothia and Lesk. 1986, EMBO Journal 5, 823-826) 
can be used. A total of 45 sequences were aUgned. 

The sequences were processed and substitutions scored according to a 
modified version of the scheme shown in Fig. 3. The modified process is shown in 
Fig. 23. 

15 

Rule la. Align sequences using kabat numbering and select all substitutions 
found in any of the germline sequences. Classify the substitutions into two 
categories: (i) substitutions found in the framework region and (ii) substitutions found 
in the CDR. Select only these substitutions and consider them separately. 

20 ^ , 

Rule lb. Reconstruct a phylogenetic tree using the Clustal W software based 

on the amino add aligmnent in the framework region. For each substitution, calculate 

the evolutionary proximity of the closest germline in which tiiat substitution occurs. 

The evolutionary proximity (EP) is calculated, where EP is as defined in Section 6.2. 

25 

Rule Ic. For each substitution in the framework group and in the CDR, 
calculate the favorability of tiiat substitirtion using a PAMIOO matrix. SM is as 
defmed in Section 6.2. 

30 Rule 2b. For each position calculate the site heterogeneity, that is a measure 

of the nmnber of different amino acids present at that position. The site heterogeneity 
is calculated as the number of different amino acids seen at a position in the set of 
homologs (SH). 
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Rule 3b. For each position calculate the site entropy SE using the algorithm 
describe in Section 6.2. 

Rule 4b. For each substitution, count the number of times it occurs in the set 
5 of homologs (SN). 

The total score is then calculated for fiamework and CDR region substitutions 
as follows: 



where fQ is a mathematical function. In this case the function was the 
parameter in the parentheses multiplied by 1, but the use of functions allows different 
weights to be applied in subsequent cycles. 



where f 0 is a mathematical function. In this case the function was the 
parameter in the parentheses multiplied by 1 , but the use of functions allows different 
20 weights to be applied in subsequent cycles. 

Based on the above scores, twenty substitutions in both the CDR and the 
framework were identified. The results of using this substitution-scoring scheme are 
shown in Table 3: 



10 



Scorepw = f(EP) x f(SH) x f(SE) x f(SN) x f(SM), 



15 



ScorecDR = f (SE) x f (SN) x f (SM), 
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Table 3 



Framework substitutions 


L^DR substitutions 




0.389716 




D31S 


30.52868 


K19R 


0.364451 




V53T 


28.91288 


nRQN 


0 330972 




D52cS 


28.58028 




0 '^14539 




N52aS 


26.10288 


TQOaA 


0 '^04669 




M35bV 


25.5501 


lOQP 


0 275096 

\J.^§ si Www 




K30S 


24.93946 


IN/OI 


n 270393 

• ww9w 




Q54Y 


23.36634 


Of llN 


0 268009 




Q60K 


23.19751 


T70S 


0.264867 




D52bG 


22.92382 


A16G 


0.262967 




H35S 


21.86205 


A85V 


0.261951 




E52D 


20.6664 


S72N 


0.258769 




D50S 


20.49003 


T66S 


0.253018 




M65I 


20.19161 


T23A 


0.2495 




N52aG 


20.19023 


N90R 


0.249449 




A63V 


19.17104 


A67R 


0.24173 




K58S 


18.77169 


Q41K 


0.22512 




E52S 


18.50896 


D69T 


0.218449 




F59V 


18.24618 


N28T 


0.217729 




D52cN 


17.85268 


R38A 


0.215293 




P57D 


17.60608 



A set of forty variants were then designed with the foUowing criteria: 



1 . Include four to six substitutions in each variant 

2. Maximize the number ofdifferent pairs ofsubstitutions that occur. If each 
variant contains five substitutions, it contains ten sets of pairs. There is thus a 
maximum of 400 pairs represented in forty variants. The variant set below was 
optimally designed using the evolutionary algorithm to maximize the number of pairs 
observed. 

In addition, the relative number of framework versus CDR substitution can be 
modulated. A maximum number of framework and/or CDR substitutions in a variant 
be set. For humanization, substitutions of human residues in framework regions 
preferred. Substitutions in the CDR are designed to retain the activity while 
hanging the amino acid in framework region more biased towards human sequences. 

This set was calculated by in siUco evolution. An initial set of variants each 
containing five substitutions was chosen randomly. Substitutions were then altered 
randomly. If a change increased the number of substitution pairs in the variant set it 
was accepted. Otherwise it was rejected. The process continued for 10000 iterations. 
The final set of variants is shown in Table 4. 



can 
are 

c! 
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Table 4 



Variant-1 


I29F 


N73T 


S71K 


Q41K 


N52aG 


Variant-2 


N73T 


A85V 


S72N 


T66S 


R38A 


Variant-3 


D69N 


R13K 


I29F 


D69T 


R38A 


Variant-4 


D69N 


N90R 


D31S 


N52aG 


F59V 


Variant-5 


N52aS 


Q54Y 


Q60K 


H35S 


E52S 


Variant-6 


K19R 


T66S 


D69T 


D31S 


D50S 


Variant-7 


146V 


T23A 


N28T 


R38A 


K58S 


Variant-8 


R13K 


N73T 


K30S 


K58S 


D52cN 


Variant-9 


A16G 


S72N 


V53T 


K30S 


D52bG 


Variant-1 0 


S71K 


T23A 


N90R 


D69T 


M65I 


Variant-1 1 


I29F 


N28T 


K30S 


H35S 


A63V 


Variant-1 2 


A85V 


N52aS 


M35bV 


K30S 


D31S 


Variant-1 3 


R13K 


D52bG 


H35S 


D50S 


M65I 


Variant-1 4 


T66S 


D52cS 


N52aG 


A63V 


D50S 


Variant-1 5 


146V 


K19R 


D69N 


M35bV 


052cN 


Variant-1 6 


S71K 


T70S 


A16G 


K58S 


E52S 


Variant-1 7 


146V 


S72N 


N90R 


A67R 


Q54Y 


Variant-1 8 


A16G 


Q41K 


R38A 


D31S 


A63V 


Variant-1 9 


146V 


N73T 


V53T 


D52cS 


E52S 


Variant-20 


146V 


T70S 


D52bG 


E52D 


P57D 


Variant-2 1 


D69N 


A85V 


M65I 


A63V 


K58S 


Variant-22 


T23A 


A67R 


D52bG 


E52S 


D52cN 


Variant-23 


T82aA 


I29F 


A67R 


D52cS 


D50S 


Variant-24 


A16G 


A85V 


T23A 


Q54Y 


D50S 


Variant-25 


A85V 


A67R 


Q41K 


N28T 


Q60K 


Variant-26 


N73T 


A67R 


D31S 


N52aS 


E52D 


Variant-27 


S71K 


T66S 


IV135bV 


Q60K 


D52bG 


Variant-28 


S72N 


N28T 


E52D 


M65I 


N52aG 


Variant-29 


K19R 


R13K 


Q54Y 


A63V 


P57D 


Variant-30 


146V 


R13K 


S71K 


N52aS 


F59V 


Variant-31 


N73T 


T70S 


Q60K 


M65I 


F59V 


Variant-32 


D69N 


T82aA 


T66S 


Q41K 


H35S 


Variant-33 


A85V 


D69T 


V53T 


F59V 


D52cN 


Variant-34 


T70S 


R38A 


D52cS 


K30S 


Q54Y 


Variant-35 


N90R 


Q41K 


E52D 


D50S 


D52cN 


Variant-36 


D69T 


M35bV 


E52D 


A63V 


P57D 


Variant-37 


I29F 


A16G 


T66S 


F59V 


P57D 


Variant-38 


R13K 


T82aA 


S72N 


D31S 


E52S 


Variant-39 


D69N 


T70S 


S72N 


T23A 


N52aS 


Variant-40 


KigR 


T82aA 


T70S 


N28T 


V53T 



5 

7. REFERENCES CITED 

All references cited herein are incorporated herein by reference in their 
entirety and for all purposes to the same extent as if each individual publication or 
10 patent or patent application was specifically and individually indicated to be 
incorporated by reference in its entirety for all purposes. 
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Aspects of the present invention can be implemented as a computer program 
product that comprises a computer program mechanism embedded in a computer 
readable storage medium. For instance, the computer program product could contain 
the program modules and/or data structures shown in Fig. 2. These program modules 
may be stored on a CD-ROM, magnetic disk storage product, digital video disk 
(DVD) or any other computer readable data or program storage product. The 
software modules in the computer program product may also be distributed 
electronically, via the Internet or otherwise, by transmission of a computer data signal 
(in which the software modules are embedded) on a carrier wave. 

Many modifications and variations of this invention can be made without 
departing from its spirit and scope, as will be apparent to those skilled in the art. The 
specific embodiments described herein are offered by way of example only, and the 
invention is to be limited only by the terms of the appended claims, along with the full 
scope of eqxiivalents to which such claims are entitled. 
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